pith. sign in

arxiv: 2511.20944 · v4 · submitted 2025-11-26 · 💻 cs.LG · cs.CR

Semantic Superiority vs. Forensic Efficiency: A Comparative Analysis of Deep Learning and Psycholinguistics for Business Email Compromise Detection

Pith reviewed 2026-05-17 04:06 UTC · model grok-4.3

classification 💻 cs.LG cs.CR
keywords Business Email CompromiseDistilBERTCatBoostcost-sensitive detectionpsycholinguisticsadversarial examplesemail fraudtransformer models
0
0 comments X

The pith

DistilBERT reaches perfect AUC for business email compromise detection while CatBoost delivers faster CPU performance under cost-sensitive rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares a semantic transformer model using DistilBERT against a psycholinguistic model using CatBoost for identifying business email compromise attacks. Both are tested on legitimate corporate emails mixed with AI-generated adversarial examples that cover 30 attack taxonomies and include Unicode obfuscation tricks. DistilBERT attains an AUC of 1.0000 and F1 of 0.9981 at 7.403 milliseconds per email on GPU, while CatBoost reaches an AUC of 0.9860 and F1 of 0.9382 at 0.855 milliseconds on CPU. The authors add a three-way decision policy that routes messages to auto-allow, auto-block, or manual review to minimize expected financial loss given the extreme cost asymmetry between missed scams and false alarms. A sympathetic reader cares because BEC attacks cause billions in annual losses and current detection systems often fail against sophisticated impersonation.

Core claim

DistilBERT achieves AUC = 1.0000 and F1 = 0.9981 at 7.403 ms per email on GPU while CatBoost achieves AUC = 0.9860 and F1 = 0.9382 at 0.855 ms on CPU when tested on a hybrid dataset of legitimate corporate email and AI-synthesised adversarial fraud across 30 BEC taxonomies. A three-way cost-sensitive decision policy optimises expected financial loss under a 1:5,167 false-negative-to-false-positive cost ratio.

What carries the argument

The three-way cost-sensitive decision policy that routes each email to auto-allow, auto-block or manual review to minimise expected loss at the stated cost ratio.

If this is right

  • High-accuracy semantic models can nearly eliminate missed BEC scams when GPU resources are available.
  • Faster psycholinguistic models enable practical deployment on standard CPU hardware with acceptable detection rates.
  • The three-way policy lets organizations tune the trade-off between financial loss from false negatives and operational cost from false positives.
  • Classical baselines such as TF-IDF with logistic regression and character n-grams with linear SVM serve as lower-performing controls.
  • Ablation of the Smiling Assassin Score and homoglyph sensitivity analysis confirm the contribution of linguistic and structural features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Production systems would need testing against documented real-world BEC cases to confirm the synthetic data results transfer.
  • A combined system could use CatBoost for initial fast screening and DistilBERT for high-risk cases.
  • The cost-sensitive routing approach could extend to other asymmetric-threat domains such as account takeover or invoice fraud.
  • Scaling the policy to enterprise email volumes would require measuring actual review costs and loss amounts in live operations.

Load-bearing premise

The AI-synthesised adversarial fraud generated across 30 BEC taxonomies is sufficiently representative of real-world BEC attempts to support the reported performance claims and cost optimization.

What would settle it

Evaluating the same models on a dataset of verified real-world BEC emails from actual incidents would reveal whether the reported AUC and F1 scores hold or drop substantially.

Figures

Figures reproduced from arXiv: 2511.20944 by Fishers, Frederick Ayivor (Independent Researcher, Ghana), Indiana, Kumasi, Technology, USA), Yaw Osei Adjei (Kwame Nkrumah University of Science.

Figure 1
Figure 1. Figure 1: Comparative Architecture. Stream A (Forensic/Green) extracts psycholinguistic features for a Gradient Boosting [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Latency Distribution (1,586 samples). CatBoost demon [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Learning Curve (CatBoost). Validation AUC plateaus [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Adversarial Robustness Analysis and Error Distribution. Top: Recall on Clean vs. Poisoned data (Left) and Degradation [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: SHAP Feature Importance. Money entities (ent money count) and threat keywords (tech threat count) dominate the Forensic decision logic. Blue regions indicate features that push predictions toward legitimate classification, while red regions indicate fraud signals. The wide distribution of certain features (e.g., complex word ratio) shows they have variable impacts depending on context [PITH_FULL_IMAGE:fig… view at source ↗
Figure 8
Figure 8. Figure 8: DistilBERT Token Impact. High attention weights [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Feature Starvation Analysis. The False Negative [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 12
Figure 12. Figure 12: ROI Sensitivity Analysis. Defense effectiveness re [PITH_FULL_IMAGE:figures/full_fig_p007_12.png] view at source ↗
Figure 10
Figure 10. Figure 10: Reliability Diagram. The model is highly calibrated [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
read the original abstract

Business Email Compromise (BEC) is a high-impact social engineering threat with extreme operational asymmetry: false negatives can trigger large financial losses, while false positives primarily incur investigation and delay costs. This paper compares two BEC detection paradigms under a cost-sensitive decision framework: (i) a semantic transformer approach (DistilBERT) for contextual language understanding, and (ii) a forensic psycholinguistic approach (CatBoost) using engineered linguistic and structural cues. We evaluate both on a hybrid dataset (N = 7,990) combining legitimate corporate email and AI-synthesised adversarial fraud generated across 30 BEC taxonomies, including character-level Unicode obfuscations. We add classical baselines (TF-IDF+LogReg and character n-gram+Linear SVM), an ablation study for the Smiling Assassin Score, and a homoglyph-map sensitivity analysis. DistilBERT achieves AUC = 1.0000 and F1 = 0.9981 at 7.403 ms per email on GPU; CatBoost achieves AUC = 0.9860 and F1 = 0.9382 at 0.855 ms on CPU. A three-way cost-sensitive decision policy (auto-allow, auto-block, manual review) optimises expected financial loss under a 1:5,167 false-negative-to-false-positive cost ratio.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript compares a semantic transformer approach using DistilBERT with a forensic psycholinguistic approach using CatBoost for detecting Business Email Compromise (BEC) attacks. It evaluates these models on a hybrid dataset of 7,990 emails, consisting of legitimate corporate emails and AI-synthesized adversarial fraud generated across 30 BEC taxonomies with character-level Unicode obfuscations. The paper reports that DistilBERT achieves an AUC of 1.0000 and F1 of 0.9981 at 7.403 ms per email on GPU, while CatBoost achieves AUC of 0.9860 and F1 of 0.9382 at 0.855 ms on CPU. It also presents classical baselines, an ablation study for the Smiling Assassin Score, a homoglyph-map sensitivity analysis, and a three-way cost-sensitive decision policy that optimizes expected financial loss under a 1:5,167 false-negative-to-false-positive cost ratio.

Significance. If the synthetic adversarial samples prove representative of real-world BEC attempts, this work would offer a valuable comparison between deep semantic understanding and efficient forensic feature engineering in high-stakes fraud detection. The inclusion of a cost-sensitive policy tailored to the extreme asymmetry of BEC risks (where false negatives lead to significant financial losses) could provide actionable insights for deploying detection systems that balance accuracy, speed, and operational costs. The ablation and sensitivity analyses add to the robustness of the findings.

major comments (2)
  1. The headline performance claims (DistilBERT AUC = 1.0000, F1 = 0.9981; CatBoost AUC = 0.9860, F1 = 0.9382) and the three-way cost-sensitive decision policy are obtained exclusively on a hybrid dataset whose positive class consists of AI-generated adversarial fraud spanning 30 taxonomies plus Unicode homoglyphs. The manuscript reports no external validation against confirmed real-world BEC corpora, no expert rating of sample realism, and no comparison of linguistic statistics between synthetic and incident data. This assumption is load-bearing for the deployment conclusions drawn from the cost optimization under the 1:5,167 ratio.
  2. The abstract states concrete AUC and F1 numbers together with runtime and cost-ratio details, yet the absence of full dataset construction, cross-validation procedure, statistical tests, and ablation outcomes leaves the central performance claims only moderately supported.
minor comments (2)
  1. The description of the Smiling Assassin Score in the ablation study could benefit from a more explicit definition or reference to its formula to aid reproducibility.
  2. Consider adding a table summarizing the runtime and performance metrics across all models (including baselines) for easier comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and thorough review. The comments highlight important considerations regarding dataset composition and methodological transparency. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: The headline performance claims (DistilBERT AUC = 1.0000, F1 = 0.9981; CatBoost AUC = 0.9860, F1 = 0.9382) and the three-way cost-sensitive decision policy are obtained exclusively on a hybrid dataset whose positive class consists of AI-generated adversarial fraud spanning 30 taxonomies plus Unicode homoglyphs. The manuscript reports no external validation against confirmed real-world BEC corpora, no expert rating of sample realism, and no comparison of linguistic statistics between synthetic and incident data. This assumption is load-bearing for the deployment conclusions drawn from the cost optimization under the 1:5,167 ratio.

    Authors: We agree that reliance on synthetic adversarial samples is a substantive limitation for direct deployment claims. The 30 taxonomies and Unicode obfuscations were deliberately constructed to emulate documented real-world BEC tactics drawn from industry reports and public threat intelligence, as real labeled corpora with such adversarial variants are rarely available due to privacy constraints. In the revised manuscript we will add a dedicated Limitations subsection that explicitly states the absence of external validation on confirmed real-world BEC corpora, notes the lack of expert realism ratings, and reports basic linguistic statistics (mean sentence length, type-token ratio, and punctuation patterns) comparing the synthetic positives to a small set of publicly documented BEC examples. We will also revise the cost-optimization discussion to frame the 1:5,167 ratio as an illustrative scenario based on published loss estimates rather than a claim of immediate operational readiness. These changes preserve the comparative value of the semantic versus forensic analysis while making the evidential boundaries clear. revision: partial

  2. Referee: The abstract states concrete AUC and F1 numbers together with runtime and cost-ratio details, yet the absence of full dataset construction, cross-validation procedure, statistical tests, and ablation outcomes leaves the central performance claims only moderately supported.

    Authors: We accept that the main text would benefit from greater methodological detail. In the revised version we will expand the Dataset Construction section to describe the full pipeline for generating the 7,990-email hybrid corpus, including the 30 BEC taxonomy templates, the character-level homoglyph mapping procedure, and the sampling strategy for legitimate corporate emails. A new Experimental Setup subsection will document the stratified 5-fold cross-validation protocol, hyperparameter search ranges, and the statistical tests performed (paired bootstrap confidence intervals on AUC and F1, plus McNemar tests for model comparisons). The ablation results for the Smiling Assassin Score and the full homoglyph sensitivity tables will be moved from the appendix into the main body with accompanying figures. These additions ensure all headline metrics are accompanied by transparent procedural and statistical support. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct empirical metrics on explicitly constructed dataset

full rationale

The paper reports AUC, F1, and cost-optimized policy values obtained by training and testing DistilBERT and CatBoost on a hybrid dataset of legitimate emails plus AI-generated BEC samples across 30 taxonomies. No mathematical derivations, self-referential equations, fitted parameters defined in terms of the target metrics, or load-bearing self-citations appear in the provided text. All performance numbers are standard hold-out or cross-validation outputs from the experimental pipeline and do not reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the synthetic adversarial emails faithfully represent real BEC threats and on the externally chosen 1:5167 cost ratio; no new entities are postulated.

free parameters (1)
  • false-negative-to-false-positive cost ratio
    The 1:5167 ratio is supplied as an operational input to the decision policy rather than derived from the data or model.
axioms (1)
  • domain assumption AI-synthesised adversarial fraud across 30 BEC taxonomies including Unicode obfuscations is representative of real attacks.
    Invoked to justify generalization from the hybrid dataset to operational performance.

pith-pipeline@v0.9.0 · 5577 in / 1542 out tokens · 46779 ms · 2026-05-17T04:06:09.548039+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

  1. [1]

    Internet Crime Report 2023,

    FBI Internet Crime Complaint Center (IC3), “Internet Crime Report 2023,” Federal Bureau of Investigation, Washington, D.C., Tech. Rep., 2024

  2. [2]

    From ChatGPT to ThreatGPT: Impact of generative AI in cybersecurity and privacy,

    M. Gupta, C. Akiri, K. Aryal, E. Parker, and L. Praharaj, “From ChatGPT to ThreatGPT: Impact of generative AI in cybersecurity and privacy,”IEEE Access, vol. 11, pp. 80218–80245, 2023

  3. [3]

    The sprawling reach of business email compro- mise,

    Trend Micro Research, “The sprawling reach of business email compro- mise,”Trend Micro Security Report, 2024

  4. [4]

    Generative models for spear phishing,

    J. Seymour and P. Tully, “Generative models for spear phishing,” in Black Hat USA, 2016

  5. [5]

    Bad characters: Imperceptible NLP attacks,

    N. Boucher and R. Anderson, “Bad characters: Imperceptible NLP attacks,” inIEEE Symposium on Security and Privacy (SP), 2023, pp. 1–18

  6. [6]

    Why phishing works,

    R. Dhamija, J. D. Tygar, and M. Hearst, “Why phishing works,” inProc. CHI Conf. Human Factors Comput. Syst., 2006, pp. 581–590

  7. [7]

    Phishing detection with BERT,

    J. Lee, F. Li, and G. Wang, “Phishing detection with BERT,” arXiv:2004.02269, 2020

  8. [8]

    DistilBERT,

    V . Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT,” inNeurIPS Workshop, 2019

  9. [9]

    HotFlip: White-box adversarial examples for text classification,

    J. Ebrahimi, A. Rao, D. Lowd, and J. Dou, “HotFlip: White-box adversarial examples for text classification,” inProc. ACL, 2018, pp. 31–36

  10. [10]

    TextAttack: A framework for adversarial attacks in NLP,

    J. Morris et al., “TextAttack: A framework for adversarial attacks in NLP,” inProc. EMNLP, 2020, pp. 119–125

  11. [11]

    Unicode technical standard #39: Unicode security mechanisms,

    Unicode Consortium, “Unicode technical standard #39: Unicode security mechanisms,” 2023

  12. [12]

    The foundations of cost-sensitive learning,

    C. Elkan, “The foundations of cost-sensitive learning,” inProc. IJCAI, 2001, pp. 973–978

  13. [13]

    Detecting credit card fraud by decision trees and SVMs,

    Y . Sahin and E. Duman, “Detecting credit card fraud by decision trees and SVMs,” inProc. IMECS, 2011

  14. [14]

    R. B. Cialdini,Influence: Science and Practice, 5th ed., Pearson, 2009

  15. [15]

    The Enron corpus,

    B. Klimt and Y . Yang, “The Enron corpus,” inProc. ECML, 2004, pp. 217–226

  16. [16]

    MITRE ATT&CK: Phishing for information (T1598),

    MITRE Corp., “MITRE ATT&CK: Phishing for information (T1598),” 2023

  17. [17]

    CatBoost,

    L. Prokhorenkova et al., “CatBoost,” inNeurIPS, 2018, pp. 6638–6648

  18. [18]

    State of the phish 2024,

    Proofpoint, “State of the phish 2024,” Tech. Rep., 2024

  19. [19]

    PhishLang,

    S. S. Roy and S. Nilizadeh, “PhishLang,” inProc. NDSS, 2024

  20. [20]

    Evaluating spam filters and stylometric detection,

    P. Modesti, “Evaluating spam filters and stylometric detection,”Journal of Cybersecurity, vol. 8, 2024

  21. [21]

    Benchmarking NN robustness,

    D. Hendrycks and T. Dietterich, “Benchmarking NN robustness,” in Proc. ICLR, 2019

  22. [22]

    Attention is all you need,

    A. Vaswani et al., “Attention is all you need,” inNIPS, 2017, pp. 5998– 6008

  23. [23]

    Devlin et al., “BERT,” inProc

    J. Devlin et al., “BERT,” inProc. NAACL, 2019, pp. 4171–4186

  24. [24]

    Survey on LLM security,

    Y . Zeng, H. Lin, and Z. Zhang, “Survey on LLM security,” arXiv:2303.03325, 2023

  25. [25]

    Instance-dependent cost-sensitive learning,

    S. H ¨oppner et al., “Instance-dependent cost-sensitive learning,” arXiv:2005.02488, 2020

  26. [26]

    Language models are few-shot learners,

    T. Brown et al., “Language models are few-shot learners,” inNeurIPS, 2020

  27. [27]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Z. Liu et al., “RoBERTa,”arXiv:1907.11692, 2019

  28. [28]

    DeBERTa,

    P. He et al., “DeBERTa,” inProc. ICLR, 2021

  29. [29]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inProc. ICLR, 2019

  30. [30]

    Scikit-learn: Machine learning in Python,

    F. Pedregosa et al., “Scikit-learn: Machine learning in Python,”JMLR, vol. 12, pp. 2825–2830, 2011

  31. [31]

    spaCy 2,

    M. Honnibal and I. Montani, “spaCy 2,” 2017

  32. [32]

    Akiba et al., “Optuna,” inProc

    T. Akiba et al., “Optuna,” inProc. KDD, 2019

  33. [33]

    Interpreting model predictions,

    S. Lundberg and S. Lee, “Interpreting model predictions,” inNeurIPS, 2017

  34. [34]

    Verification of forecasts,

    G. W. Brier, “Verification of forecasts,”Monthly Weather Review, vol. 78, pp. 1–3, 1950. APPENDIXA FEATUREDEFINITIONS Table VII lists the psycholinguistic features used in the analysis. It defines each feature and states what linguistic attribute it captures, including politeness–urgency interaction, authority cues, hedge frequency, sentiment shift, and u...