Separating Secrets from Placeholders: A Hybrid CNN-CodeBERT Framework for Three-Class Credential Leakage Detection

Khushika Shah; Lei Zhang; Maksuda Bilkis Baby; Naiyue Liang

arxiv: 2605.31520 · v1 · pith:TKNFZWFQnew · submitted 2026-05-29 · 💻 cs.SE · cs.AI· cs.CR

Separating Secrets from Placeholders: A Hybrid CNN-CodeBERT Framework for Three-Class Credential Leakage Detection

Maksuda Bilkis Baby , Khushika Shah , Naiyue Liang , Lei Zhang This is my paper

Pith reviewed 2026-06-28 21:20 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CR

keywords credential leakage detectionthree-class classificationCodeBERTCNNfalse positive reductionsource code securitymachine learningplaceholder detection

0 comments

The pith

A hybrid CNN-CodeBERT model classifies code credentials into genuine leaks, placeholders, and non-credentials to cut false alerts by one third.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that binary detection of credential leaks in source code fails because it cannot separate real exposed secrets from placeholder strings or weak values that developers intentionally leave behind. By training a three-class system that treats placeholders as their own category, the approach aims to lower the flood of high-severity alerts that currently overwhelm security teams. The model pairs CodeBERT embeddings for code context with a CNN for character-level patterns and is tested on a fresh collection of 9,426 labeled examples drawn from ten programming languages. If the results hold, security tools could maintain high recall on actual leaks while discarding roughly a third of the alerts that previously required manual review.

Core claim

On the 9,426-sample dataset the hybrid model reaches a Matthews correlation coefficient of 0.86 and macro F1 of 0.90, delivering 93 percent recall and 89 percent precision on genuine credential leaks, cutting high-severity alerts from 373 to 250, and lifting placeholder-or-weak-credential F1 from 54 percent to 81 percent while preserving coverage across languages.

What carries the argument

The hybrid CNN-CodeBERT three-class classifier that combines semantic embeddings from CodeBERT with character-level convolutional pattern detection.

If this is right

Genuine credential leaks are detected at 93 percent recall and 89 percent precision.
High-severity alerts drop by 33 percent from 373 to 250 while security coverage stays intact.
Placeholder and weak credential detection rises from 54 percent to 81 percent F1.
Nine of the ten languages reach F1 above 0.80 under leave-one-language-out testing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Security scanning tools could integrate the three-class output to route only genuine-leak candidates to human analysts.
The same separation of intentional placeholders from accidental leaks may generalize to other sensitive string types such as API tokens or database connection strings.
Development environments could adopt the classifier for inline warnings that distinguish real exposures from test values without constant false alarms.

Load-bearing premise

The three-class labels assigned to the 9,426 samples are accurate and free of systematic bias or error across the ten languages.

What would settle it

Re-label a random 20 percent subset of the dataset by independent reviewers, retrain the model, and measure whether the reported MCC, F1 scores, and alert reduction remain within five percent of the original figures.

Figures

Figures reproduced from arXiv: 2605.31520 by Khushika Shah, Lei Zhang, Maksuda Bilkis Baby, Naiyue Liang.

**Figure 1.** Figure 1: Overview of the proposed hybrid credential leakage detection model. A CodeBERT-based semantic encoder models surrounding code context to [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Confusion matrix of the proposed model under Seed 42. Out of 943 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Overall performance comparison across methods. Our approach [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Credential leakage in public source code repositories poses a critical security threat, with over 23.8 million secrets exposed in 2024 alone. Existing detection tools suffer from high false-positive rates because rigid pattern matching and binary classification schemes fail to distinguish genuine credentials from placeholder or weak credentials. We propose a three-class classification framework that explicitly models placeholder or weak credentials as a distinct class, leveraging CodeBERT-based semantic understanding combined with character-level pattern recognition. We evaluate our approach on a newly constructed dataset of 9,426 samples spanning 10 programming languages. Our model achieves a Matthews Correlation Coefficient of 0.86 and a macro F1-score of 0.90, achieving 93% recall and 89% precision for genuine credential leaks while reducing high severity alerts by 33.0% (from 373 to 250) without sacrificing security coverage. Compared to prior character-level approaches, our method improves placeholder or weak credential detection from 54% to 81% F1-score while maintaining strong cross language generalization, with 9 of 10 languages achieving F1 above 0.80 under leave-one-language-out evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The three-class hybrid model improves placeholder detection but the new dataset's labeling process is undocumented, which undercuts the metrics.

read the letter

The main thing here is a shift from binary secret detection to three classes—genuine leaks, placeholders/weak creds, and neither—using CodeBERT for context plus CNN for patterns. They built a new 9,426-sample set across 10 languages and report MCC 0.86, macro F1 0.90, 93% recall on real leaks, and a 33% drop in high-severity alerts while lifting placeholder F1 from 54% to 81%. The leave-one-language-out results also hold up across most languages.

That framing is useful. Real scanners waste time on things like example passwords, so treating placeholders separately is a direct way to cut noise without losing coverage. The hybrid architecture makes sense for mixing semantic and character signals.

The problem is the data. All the numbers depend on accurate three-class labels, yet the abstract gives zero detail on annotation rules, how ground truth was established, inter-annotator checks, or steps to avoid bias across languages. If 15-20% of the 'genuine' or 'placeholder' labels are off, the claimed gains and alert reduction become unreliable. No training hyperparameters, baseline comparisons beyond vague 'prior character-level' mentions, or statistical tests are described either.

This is for people who maintain or evaluate secret-scanning tools in open-source or enterprise settings. A reader focused on practical false-positive reduction would find the three-class idea worth testing, but only after seeing the full methods section.

Send it for review so the authors can supply the missing labeling protocol and any code or data release. The core problem is worth the effort if the evidence holds.

Referee Report

2 major / 2 minor

Summary. The paper proposes a hybrid CNN-CodeBERT framework for three-class credential leakage detection in source code (genuine leaks vs. placeholders/weak credentials vs. none), evaluated on a newly constructed 9,426-sample dataset spanning 10 languages. It claims MCC of 0.86, macro F1 of 0.90, 93% recall and 89% precision on genuine leaks, 33% reduction in high-severity alerts (373 to 250), and improved placeholder detection F1 from 54% to 81%, with strong leave-one-language-out generalization.

Significance. If the three-class labels prove reliable and representative, the work could meaningfully advance credential detection by explicitly modeling placeholders as a separate class, reducing alert fatigue while preserving coverage. The cross-language evaluation and comparison to prior character-level baselines are strengths that would support practical adoption in security tooling if validated.

major comments (2)

[Abstract and dataset construction] Abstract and dataset description: All headline metrics (MCC 0.86, macro F1 0.90, 93% recall/89% precision on genuine leaks, 33% alert reduction) rest on the correctness of the three-class labels for the 9,426 samples. No annotation protocol, inter-annotator agreement, ground-truth source for secrets, or bias-mitigation steps are provided, so it is impossible to determine whether the reported gains are supported by accurate labels or undermined by mislabeling between genuine and placeholder classes.
[Evaluation] Evaluation and results sections: The claim of reducing high-severity alerts by 33% without sacrificing security coverage, plus the per-class F1 improvements and leave-one-language-out results, cannot be assessed without details on how the test set was sampled, labeled, or validated across the 10 languages. If labeling error exceeds ~15%, both the security-coverage assertion and the cross-language generalization claim become unreliable.

minor comments (2)

[Abstract] The abstract mentions 'character-level pattern recognition' combined with CodeBERT but provides no diagram or equation for the hybrid architecture; a figure or pseudocode would clarify the CNN integration.
[Results] Table or figure reporting per-language F1 scores under leave-one-out evaluation is referenced but not described in sufficient detail for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current manuscript insufficiently documents the dataset labeling process and test set construction, which are foundational to interpreting the reported metrics. We will revise the paper to provide these details.

read point-by-point responses

Referee: [Abstract and dataset construction] Abstract and dataset description: All headline metrics (MCC 0.86, macro F1 0.90, 93% recall/89% precision on genuine leaks, 33% alert reduction) rest on the correctness of the three-class labels for the 9,426 samples. No annotation protocol, inter-annotator agreement, ground-truth source for secrets, or bias-mitigation steps are provided, so it is impossible to determine whether the reported gains are supported by accurate labels or undermined by mislabeling between genuine and placeholder classes.

Authors: We acknowledge that the manuscript does not describe the annotation protocol. The 9,426 samples were derived from candidate strings extracted via pattern matching from public GitHub repositories across 10 languages, followed by manual labeling into genuine leaks, placeholders/weak credentials, and none. In the revised version we will add a dedicated Dataset Construction subsection that specifies the labeling guidelines (including decision criteria for distinguishing genuine credentials from placeholders based on context and entropy), the annotator background, any automated pre-filtering steps, and bias-mitigation measures such as stratified sampling by language and repository. We will also state whether multiple annotators were used and, if so, report inter-annotator agreement; if labeling was performed by a single expert, we will describe the validation steps taken. revision: yes
Referee: [Evaluation] Evaluation and results sections: The claim of reducing high-severity alerts by 33% without sacrificing security coverage, plus the per-class F1 improvements and leave-one-language-out results, cannot be assessed without details on how the test set was sampled, labeled, or validated across the 10 languages. If labeling error exceeds ~15%, both the security-coverage assertion and the cross-language generalization claim become unreliable.

Authors: We agree that additional information on test-set sampling and validation is required. The test portion was drawn from the full 9,426-sample corpus with explicit stratification to maintain class and language balance. In the revision we will expand the Evaluation section to detail the exact train/test split ratios, the sampling procedure, how labels on the test set were independently verified, and any sensitivity analysis of the 33% alert-reduction figure under plausible labeling-error rates. We will also include per-language breakdown tables for the leave-one-language-out experiments to further substantiate the generalization claims. revision: yes

Circularity Check

0 steps flagged

No circularity; metrics evaluated on independently constructed dataset

full rationale

The paper reports MCC 0.86 and macro F1 0.90 on a newly constructed 9,426-sample three-class dataset spanning 10 languages, with explicit leave-one-language-out evaluation and comparison to prior methods. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation. The central results are standard supervised classification metrics on held-out data; label construction is asserted but does not reduce any equation or claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no identifiable free parameters, axioms, or invented entities; full text would be required to audit modeling choices or assumptions.

pith-pipeline@v0.9.1-grok · 5745 in / 1050 out tokens · 35254 ms · 2026-06-28T21:20:50.044227+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 5 canonical work pages · 4 internal anchors

[1]

Automated detection of password leakage from public github repositories,

R. Fenget al., “Automated detection of password leakage from public github repositories,” inProceedings of the International Conference on Software Engineering (ICSE), pp. 175–186, 2022

2022
[2]

IssueGuard: Real-Time Secret Leak Prevention Tool for GitHub Issue Reports

M. N. Rahman, S. Ahmed, Z. Wahab, G. Uddin, and R. Shahriyar, “IssueGuard: Real-time secret leak prevention tool for GitHub issue reports,”arXiv preprint arXiv:2602.08072, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

State of secrets sprawl 2025: The definitive annual report on the state of secrets exposure on github,

GitGuardian, “State of secrets sprawl 2025: The definitive annual report on the state of secrets exposure on github,” tech. rep., GitGuardian, March 2025

2025
[4]

Information about 2016 data security incident,

Uber Technologies Inc., “Information about 2016 data security incident,” 2016

2016
[5]

How bad can it git? characterizing secret leakage in public github repositories,

M. Meliet al., “How bad can it git? characterizing secret leakage in public github repositories,” inProceedings of the Network and Distributed System Security Symposium (NDSS), 2019

2019
[6]

Gitleaks: Detecting hardcoded secrets using pat- tern matching and entropy analysis

Gitleaks Project, “Gitleaks: Detecting hardcoded secrets using pat- tern matching and entropy analysis.” https://github.com/gitleaks/gitleaks, 2024

2024
[7]

TruffleHog: Searching for high-entropy strings and secrets in git repositories

Truffle Security, “TruffleHog: Searching for high-entropy strings and secrets in git repositories.” https://github.com/trufflesecurity/trufflehog, 2024

2024
[8]

A comparative study of software secrets reporting by secret detection tools,

S. K. Basaket al., “A comparative study of software secrets reporting by secret detection tools,” inProceedings of ESEM, pp. 1–12, 2023

2023
[9]

Asleep at the keyboard? assessing the security of ai-generated code,

A. Pearce, L. Li, and K. Sen, “Asleep at the keyboard? assessing the security of ai-generated code,” inProc. of the IEEE Symposium on Security and Privacy, pp. 765–783, 2022

2022
[10]

Lost at c: A user study on the security implications of large language model code assistants,

G. Sandoval, H. Pearce, T. Nys, R. Karri, S. Garg, and B. Dolan-Gavitt, “Lost at c: A user study on the security implications of large language model code assistants,” in32nd USENIX Security Symposium (USENIX Security 23), pp. 2205–2222, 2023

2023
[11]

Static analysis for security,

B. Chess and G. McGraw, “Static analysis for security,”IEEE Security & Privacy, vol. 2, no. 6, pp. 76–79, 2004

2004
[12]

Why secret detection tools are not enough: An industrial case study,

M. R. Rahmanet al., “Why secret detection tools are not enough: An industrial case study,”Empirical Software Engineering, vol. 27, no. 3, pp. 1–29, 2022

2022
[13]

A survey of machine learning for big code and naturalness,

M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton, “A survey of machine learning for big code and naturalness,”ACM Computing Surveys, vol. 51, no. 4, pp. 1–37, 2018

2018
[14]

The seven sins: Security smells in infrastructure as code scripts,

A. Rahman, C. Parnin, and L. Williams, “The seven sins: Security smells in infrastructure as code scripts,” in2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 164–175, IEEE, 2019

2019
[15]

Hey, your secrets leaked! detecting and characterizing secret leakage in the wild,

J. Zhouet al., “Hey, your secrets leaked! detecting and characterizing secret leakage in the wild,” inProceedings of the IEEE Symposium on Security and Privacy (SP), pp. 449–467, 2025

2025
[16]

Don’t leak your keys: Understanding, measuring, and exploiting the appsecret leaks in mini-programs,

Y . Zhang, Y . Yang, and Z. Lin, “Don’t leak your keys: Understanding, measuring, and exploiting the appsecret leaks in mini-programs,” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pp. 2411–2425, 2023

2023
[17]

The skeleton keys: A large-scale analysis of credential leakage in mini- apps,

Y . Shi, G. Yang, Z. Yang, Y . Yang, M. Yang, K. Zhong, and X. Zhang, “The skeleton keys: A large-scale analysis of credential leakage in mini- apps,” inProceedings of the Network and Distributed System Security Symposium (NDSS), (San Diego, CA, USA), Internet Society, 2025

2025
[18]

Secrets revealed in container images: an internet-wide study on occurrence and impact,

M. Dahlmanns, C. Sander, R. Decker, and K. Wehrle, “Secrets revealed in container images: an internet-wide study on occurrence and impact,” inProceedings of the 2023 ACM Asia Conference on Computer and Communications Security, pp. 797–811, 2023

2023
[19]

Leaky apps: Large-scale analysis of secrets distributed in android and ios apps,

D. Schmidt, S. Schrittwieser, and E. Weippl, “Leaky apps: Large-scale analysis of secrets distributed in android and ios apps,” inProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, pp. 2459–2473, 2025

2025
[20]

Secrets in source code: Reducing false positives using machine learning,

A. Sahaet al., “Secrets in source code: Reducing false positives using machine learning,” inProceedings of COMSNETS, pp. 168–175, 2020

2020
[21]

Why don’t software developers use static analysis tools to find bugs?,

B. Johnson, Y . Song, E. Murphy-Hill, and R. Bowdidge, “Why don’t software developers use static analysis tools to find bugs?,” in2013 35th International Conference on Software Engineering (ICSE), pp. 672–681, IEEE, 2013

2013
[22]

Evaluating static analysis defect warnings on production software,

N. Ayewah, W. Pugh, J. D. Morgenthaler, J. Penix, and Y . Zhou, “Evaluating static analysis defect warnings on production software,” in Proceedings of the 7th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, pp. 1–8, 2007

2007
[23]

Using static analysis to find bugs,

N. Ayewah, W. Pugh, D. Hovemeyer, J. D. Morgenthaler, and J. Penix, “Using static analysis to find bugs,”IEEE software, vol. 25, no. 5, pp. 22–29, 2008

2008
[24]

The google findbugs fixit,

N. Ayewah and W. Pugh, “The google findbugs fixit,” inProceedings of the 19th international symposium on Software testing and analysis, pp. 241–252, 2010

2010
[25]

Why secret detection tools are not enough: It’s not just about false positives-an industrial case study,

M. R. Rahman, N. Imtiaz, M.-A. Storey, and L. Williams, “Why secret detection tools are not enough: It’s not just about false positives-an industrial case study,”Empirical Software Engineering, vol. 27, no. 3, p. 59, 2022

2022
[26]

Assetharvester: A static analysis tool for detecting secret- asset pairs in software artifacts,

S. K. Basak, K. V . English, K. Ogura, V . Kambara, B. Reaves, and L. Williams, “Assetharvester: A static analysis tool for detecting secret- asset pairs in software artifacts,”arXiv preprint arXiv:2403.19072, 2024

work page arXiv 2024
[27]

Large language model for vulnerability detection: Emerging results and future directions,

X. Zhou, T. Zhang, and D. Lo, “Large language model for vulnerability detection: Emerging results and future directions,” inProceedings of the 46th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), 2024

2024
[28]

On the Opportunities and Risks of Foundation Models

R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, and et al., “On the opportunities and risks of foundation models,”arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, “Codesearchnet challenge: Evaluating the state of semantic code search,” arXiv preprint arXiv:1909.09436, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[30]

The vault: A comprehensive multilingual dataset for advancing code understanding and generation,

D. Nguyen, L. Nam, A. Dau, A. Nguyen, K. Nghiem, J. Guo, and N. Bui, “The vault: A comprehensive multilingual dataset for advancing code understanding and generation,” inFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 4763–4788, 2023

2023
[31]

Codebert: A pre-trained model for programming and natural languages,

Z. Fenget al., “Codebert: A pre-trained model for programming and natural languages,” inFindings of EMNLP, 2020

2020
[32]

Graphcodebert: Pre-training code representations with data flow,

D. Guo, S. Ren, S. Lu,et al., “Graphcodebert: Pre-training code representations with data flow,” inProceedings of the International Conference on Learning Representations (ICLR), 2021

2021
[33]

Unified pre- training for program understanding and generation,

W. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified pre- training for program understanding and generation,” inProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pp. 2655– 2668, 2021

2021
[34]

UniXcoder: Unified Cross-Modal Pre-training for Code Representation

D. Guo, S. Lu, N. Duan, Y . Wang, M. Zhou, and J. Yin, “Unixcoder: Unified cross-modal pre-training for code representation,”arXiv preprint arXiv:2203.03850, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

PyGithub: Typed interactions with the GitHub API v3

PyGithub Contributors, “PyGithub: Typed interactions with the GitHub API v3.” https://github.com/PyGithub/PyGithub, Feb. 2026

2026
[36]

The promises and perils of mining github,

E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M. German, and D. Damian, “The promises and perils of mining github,” inProceedings of the 11th working conference on mining software repositories, pp. 92– 101, 2014

2014
[37]

A systematic map- ping study of software development with github,

V . Cosentino, J. L. C. Izquierdo, and J. Cabot, “A systematic map- ping study of software development with github,”Ieee access, vol. 5, pp. 7173–7192, 2017

2017
[38]

Inter-coder agreement for computational linguistics,

R. Artstein and M. Poesio, “Inter-coder agreement for computational linguistics,”Computational linguistics, vol. 34, no. 4, pp. 555–596, 2008

2008
[39]

Character-level convolutional net- works for text classification,

X. Zhang, J. Zhao, and Y . LeCun, “Character-level convolutional net- works for text classification,”Advances in neural information processing systems, vol. 28, 2015

2015
[40]

Focal loss for dense object detection,

T.-Y . Linet al., “Focal loss for dense object detection,” inProceedings of ICCV, pp. 2980–2988, 2017

2017

[1] [1]

Automated detection of password leakage from public github repositories,

R. Fenget al., “Automated detection of password leakage from public github repositories,” inProceedings of the International Conference on Software Engineering (ICSE), pp. 175–186, 2022

2022

[2] [2]

IssueGuard: Real-Time Secret Leak Prevention Tool for GitHub Issue Reports

M. N. Rahman, S. Ahmed, Z. Wahab, G. Uddin, and R. Shahriyar, “IssueGuard: Real-time secret leak prevention tool for GitHub issue reports,”arXiv preprint arXiv:2602.08072, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

State of secrets sprawl 2025: The definitive annual report on the state of secrets exposure on github,

GitGuardian, “State of secrets sprawl 2025: The definitive annual report on the state of secrets exposure on github,” tech. rep., GitGuardian, March 2025

2025

[4] [4]

Information about 2016 data security incident,

Uber Technologies Inc., “Information about 2016 data security incident,” 2016

2016

[5] [5]

How bad can it git? characterizing secret leakage in public github repositories,

M. Meliet al., “How bad can it git? characterizing secret leakage in public github repositories,” inProceedings of the Network and Distributed System Security Symposium (NDSS), 2019

2019

[6] [6]

Gitleaks: Detecting hardcoded secrets using pat- tern matching and entropy analysis

Gitleaks Project, “Gitleaks: Detecting hardcoded secrets using pat- tern matching and entropy analysis.” https://github.com/gitleaks/gitleaks, 2024

2024

[7] [7]

TruffleHog: Searching for high-entropy strings and secrets in git repositories

Truffle Security, “TruffleHog: Searching for high-entropy strings and secrets in git repositories.” https://github.com/trufflesecurity/trufflehog, 2024

2024

[8] [8]

A comparative study of software secrets reporting by secret detection tools,

S. K. Basaket al., “A comparative study of software secrets reporting by secret detection tools,” inProceedings of ESEM, pp. 1–12, 2023

2023

[9] [9]

Asleep at the keyboard? assessing the security of ai-generated code,

A. Pearce, L. Li, and K. Sen, “Asleep at the keyboard? assessing the security of ai-generated code,” inProc. of the IEEE Symposium on Security and Privacy, pp. 765–783, 2022

2022

[10] [10]

Lost at c: A user study on the security implications of large language model code assistants,

G. Sandoval, H. Pearce, T. Nys, R. Karri, S. Garg, and B. Dolan-Gavitt, “Lost at c: A user study on the security implications of large language model code assistants,” in32nd USENIX Security Symposium (USENIX Security 23), pp. 2205–2222, 2023

2023

[11] [11]

Static analysis for security,

B. Chess and G. McGraw, “Static analysis for security,”IEEE Security & Privacy, vol. 2, no. 6, pp. 76–79, 2004

2004

[12] [12]

Why secret detection tools are not enough: An industrial case study,

M. R. Rahmanet al., “Why secret detection tools are not enough: An industrial case study,”Empirical Software Engineering, vol. 27, no. 3, pp. 1–29, 2022

2022

[13] [13]

A survey of machine learning for big code and naturalness,

M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton, “A survey of machine learning for big code and naturalness,”ACM Computing Surveys, vol. 51, no. 4, pp. 1–37, 2018

2018

[14] [14]

The seven sins: Security smells in infrastructure as code scripts,

A. Rahman, C. Parnin, and L. Williams, “The seven sins: Security smells in infrastructure as code scripts,” in2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 164–175, IEEE, 2019

2019

[15] [15]

Hey, your secrets leaked! detecting and characterizing secret leakage in the wild,

J. Zhouet al., “Hey, your secrets leaked! detecting and characterizing secret leakage in the wild,” inProceedings of the IEEE Symposium on Security and Privacy (SP), pp. 449–467, 2025

2025

[16] [16]

Don’t leak your keys: Understanding, measuring, and exploiting the appsecret leaks in mini-programs,

Y . Zhang, Y . Yang, and Z. Lin, “Don’t leak your keys: Understanding, measuring, and exploiting the appsecret leaks in mini-programs,” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pp. 2411–2425, 2023

2023

[17] [17]

The skeleton keys: A large-scale analysis of credential leakage in mini- apps,

Y . Shi, G. Yang, Z. Yang, Y . Yang, M. Yang, K. Zhong, and X. Zhang, “The skeleton keys: A large-scale analysis of credential leakage in mini- apps,” inProceedings of the Network and Distributed System Security Symposium (NDSS), (San Diego, CA, USA), Internet Society, 2025

2025

[18] [18]

Secrets revealed in container images: an internet-wide study on occurrence and impact,

M. Dahlmanns, C. Sander, R. Decker, and K. Wehrle, “Secrets revealed in container images: an internet-wide study on occurrence and impact,” inProceedings of the 2023 ACM Asia Conference on Computer and Communications Security, pp. 797–811, 2023

2023

[19] [19]

Leaky apps: Large-scale analysis of secrets distributed in android and ios apps,

D. Schmidt, S. Schrittwieser, and E. Weippl, “Leaky apps: Large-scale analysis of secrets distributed in android and ios apps,” inProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, pp. 2459–2473, 2025

2025

[20] [20]

Secrets in source code: Reducing false positives using machine learning,

A. Sahaet al., “Secrets in source code: Reducing false positives using machine learning,” inProceedings of COMSNETS, pp. 168–175, 2020

2020

[21] [21]

Why don’t software developers use static analysis tools to find bugs?,

B. Johnson, Y . Song, E. Murphy-Hill, and R. Bowdidge, “Why don’t software developers use static analysis tools to find bugs?,” in2013 35th International Conference on Software Engineering (ICSE), pp. 672–681, IEEE, 2013

2013

[22] [22]

Evaluating static analysis defect warnings on production software,

N. Ayewah, W. Pugh, J. D. Morgenthaler, J. Penix, and Y . Zhou, “Evaluating static analysis defect warnings on production software,” in Proceedings of the 7th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, pp. 1–8, 2007

2007

[23] [23]

Using static analysis to find bugs,

N. Ayewah, W. Pugh, D. Hovemeyer, J. D. Morgenthaler, and J. Penix, “Using static analysis to find bugs,”IEEE software, vol. 25, no. 5, pp. 22–29, 2008

2008

[24] [24]

The google findbugs fixit,

N. Ayewah and W. Pugh, “The google findbugs fixit,” inProceedings of the 19th international symposium on Software testing and analysis, pp. 241–252, 2010

2010

[25] [25]

Why secret detection tools are not enough: It’s not just about false positives-an industrial case study,

M. R. Rahman, N. Imtiaz, M.-A. Storey, and L. Williams, “Why secret detection tools are not enough: It’s not just about false positives-an industrial case study,”Empirical Software Engineering, vol. 27, no. 3, p. 59, 2022

2022

[26] [26]

Assetharvester: A static analysis tool for detecting secret- asset pairs in software artifacts,

S. K. Basak, K. V . English, K. Ogura, V . Kambara, B. Reaves, and L. Williams, “Assetharvester: A static analysis tool for detecting secret- asset pairs in software artifacts,”arXiv preprint arXiv:2403.19072, 2024

work page arXiv 2024

[27] [27]

Large language model for vulnerability detection: Emerging results and future directions,

X. Zhou, T. Zhang, and D. Lo, “Large language model for vulnerability detection: Emerging results and future directions,” inProceedings of the 46th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), 2024

2024

[28] [28]

On the Opportunities and Risks of Foundation Models

R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, and et al., “On the opportunities and risks of foundation models,”arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[29] [29]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, “Codesearchnet challenge: Evaluating the state of semantic code search,” arXiv preprint arXiv:1909.09436, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[30] [30]

The vault: A comprehensive multilingual dataset for advancing code understanding and generation,

D. Nguyen, L. Nam, A. Dau, A. Nguyen, K. Nghiem, J. Guo, and N. Bui, “The vault: A comprehensive multilingual dataset for advancing code understanding and generation,” inFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 4763–4788, 2023

2023

[31] [31]

Codebert: A pre-trained model for programming and natural languages,

Z. Fenget al., “Codebert: A pre-trained model for programming and natural languages,” inFindings of EMNLP, 2020

2020

[32] [32]

Graphcodebert: Pre-training code representations with data flow,

D. Guo, S. Ren, S. Lu,et al., “Graphcodebert: Pre-training code representations with data flow,” inProceedings of the International Conference on Learning Representations (ICLR), 2021

2021

[33] [33]

Unified pre- training for program understanding and generation,

W. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified pre- training for program understanding and generation,” inProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pp. 2655– 2668, 2021

2021

[34] [34]

UniXcoder: Unified Cross-Modal Pre-training for Code Representation

D. Guo, S. Lu, N. Duan, Y . Wang, M. Zhou, and J. Yin, “Unixcoder: Unified cross-modal pre-training for code representation,”arXiv preprint arXiv:2203.03850, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[35] [35]

PyGithub: Typed interactions with the GitHub API v3

PyGithub Contributors, “PyGithub: Typed interactions with the GitHub API v3.” https://github.com/PyGithub/PyGithub, Feb. 2026

2026

[36] [36]

The promises and perils of mining github,

E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M. German, and D. Damian, “The promises and perils of mining github,” inProceedings of the 11th working conference on mining software repositories, pp. 92– 101, 2014

2014

[37] [37]

A systematic map- ping study of software development with github,

V . Cosentino, J. L. C. Izquierdo, and J. Cabot, “A systematic map- ping study of software development with github,”Ieee access, vol. 5, pp. 7173–7192, 2017

2017

[38] [38]

Inter-coder agreement for computational linguistics,

R. Artstein and M. Poesio, “Inter-coder agreement for computational linguistics,”Computational linguistics, vol. 34, no. 4, pp. 555–596, 2008

2008

[39] [39]

Character-level convolutional net- works for text classification,

X. Zhang, J. Zhao, and Y . LeCun, “Character-level convolutional net- works for text classification,”Advances in neural information processing systems, vol. 28, 2015

2015

[40] [40]

Focal loss for dense object detection,

T.-Y . Linet al., “Focal loss for dense object detection,” inProceedings of ICCV, pp. 2980–2988, 2017

2017