Security Document Classification with a Fine-Tuned Local Large Language Model: Benchmark Data and an Open-Source System
Pith reviewed 2026-05-21 07:24 UTC · model grok-4.3
The pith
A fine-tuned local large language model classifies security documents at 95 percent accuracy while keeping all processing under local control.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that fine-tuning a local large language model on a dataset of 78,358 samples drawn from 13 permissively licensed sources together with GPT-4 synthetic data enables accurate classification of security documents into seven categories and 51 subcategories. The resulting system attains 95.0 percent category-level accuracy on a benchmark of 1,000 documents with a 95 percent confidence interval of 93.5 to 96.2 percent, while commercial models score between 75.4 and 79.9 percent under the same protocol. On an external held-out set of 500 samples the model reaches 93.8 percent accuracy. This performance demonstrates that accurate, context-sensitive classification can occur while
What carries the argument
The fine-tuned Qwen 3.5 27B model trained on 78,358 samples covering seven security categories and 51 subcategories, which performs the entire classification task on local hardware without external data transmission.
If this is right
- Organizations can classify sensitive documents accurately without sending data to cloud services.
- Document processing remains under local control, reducing exposure risks during scanning.
- The approach outperforms commercial models when both use the same prompting protocol.
- Performance holds at 93.8 percent on a separate external validation set of 500 samples.
- An open-source implementation and benchmark dataset become available for further local security tools.
Where Pith is reading between the lines
- The same fine-tuning pattern could extend to context-aware classification of sensitive material in regulated fields such as healthcare records or financial filings.
- Smaller distilled versions of the model might enable deployment on standard office hardware while preserving most of the accuracy.
- Integration with local file systems could allow automated flagging of documents before they enter shared repositories.
Load-bearing premise
The 78,358 samples from 13 permissively licensed sources combined with GPT-4 synthetic data sufficiently represent the diversity, context, and boundary cases of real-world security documents across the seven categories and 51 subcategories.
What would settle it
Testing the released model on a new collection of real security documents drawn from an organization or time period absent from the training sources and measuring whether category-level accuracy falls substantially below 90 percent.
Figures
read the original abstract
Organizations that scan documents for sensitive information face a practical problem. Cloud services require data to be sent to external infrastructure, while rule-based tools often miss threats that depend on context. This study presents TorchSight, an open-source local system for security document classification built around a fine-tuned Qwen 3.5 27B model. The model was trained on 78,358 samples from 13 permissively licensed sources and GPT-4 synthetic data covering seven security categories and 51 subcategories. In the main evaluation on 1,000 documents, the model reached 95.0% category-level accuracy (95% confidence interval: 93.5-96.2). The tested commercial models scored 75.4-79.9% under the same prompting protocol. On a separate external set of 500 held-out samples, the model reached 93.8% accuracy, which suggests that performance extends beyond the main benchmark, although the margin depends on dataset composition and difficult boundary cases. The results show that a fine-tuned local model can support accurate security document classification while keeping document processing under local control.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TorchSight, an open-source local system for security document classification using a fine-tuned Qwen 3.5 27B model. The system is trained on 78,358 samples drawn from 13 permissively licensed sources augmented with GPT-4 synthetic data spanning seven security categories and 51 subcategories. On a 1,000-document internal benchmark the model achieves 95.0% category-level accuracy (95% CI 93.5-96.2), outperforming prompted commercial models (75.4-79.9%) under identical prompting; on a separate 500-sample external held-out set it reaches 93.8% accuracy.
Significance. If the reported accuracies and generalization hold, the work demonstrates that a fine-tuned local LLM can deliver high-accuracy security document classification while preserving data locality, offering a practical alternative to cloud services. The release of benchmark data and open-source code supports reproducibility and further community evaluation in privacy-sensitive domains.
major comments (2)
- [Section 3] Section 3 (Data Curation): The construction of the 78,358-sample training set from 13 sources plus GPT-4 synthetic data is presented without quantitative coverage analysis (e.g., per-subcategory sample counts, representation of mixed-category or novel-format documents, or explicit handling of boundary cases). Because the central claim of reliable 93.8% performance on the external set rests on the assumption that this distribution matches operational security documents, the absence of such diagnostics weakens the generalization argument.
- [Section 5.2] Section 5.2 (External Validation): The external set of 500 samples is described as held-out, yet the paper does not report whether any of these samples were used in GPT-4 synthetic data generation or share lexical/contextual overlap with the training distribution. This detail is load-bearing for interpreting the 93.8% accuracy as evidence of true out-of-distribution robustness rather than partial memorization.
minor comments (2)
- [Abstract] Abstract and Section 1: The caveat that 'the margin depends on dataset composition and difficult boundary cases' is stated but not quantified; adding a short limitations paragraph with concrete examples of failure modes would improve clarity without altering the main results.
- [Table 1] Table 1 or equivalent results table: Ensure confidence intervals are reported consistently for all compared models and that the prompting templates used for commercial baselines are reproduced verbatim in an appendix.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We address each major comment below and indicate where revisions will be made to improve the manuscript.
read point-by-point responses
-
Referee: [Section 3] Section 3 (Data Curation): The construction of the 78,358-sample training set from 13 sources plus GPT-4 synthetic data is presented without quantitative coverage analysis (e.g., per-subcategory sample counts, representation of mixed-category or novel-format documents, or explicit handling of boundary cases). Because the central claim of reliable 93.8% performance on the external set rests on the assumption that this distribution matches operational security documents, the absence of such diagnostics weakens the generalization argument.
Authors: We agree that additional quantitative diagnostics in Section 3 would strengthen the generalization argument. In the revised manuscript we will add per-subcategory sample counts, a brief discussion of mixed-category and novel-format documents, and our curation approach to boundary cases. revision: yes
-
Referee: [Section 5.2] Section 5.2 (External Validation): The external set of 500 samples is described as held-out, yet the paper does not report whether any of these samples were used in GPT-4 synthetic data generation or share lexical/contextual overlap with the training distribution. This detail is load-bearing for interpreting the 93.8% accuracy as evidence of true out-of-distribution robustness rather than partial memorization.
Authors: The external set was drawn from sources disjoint from the 13 training sources and was not used to generate the GPT-4 synthetic data. We will revise Section 5.2 to explicitly describe this source-level separation and confirm that no samples from the external set entered the synthetic augmentation pipeline, thereby supporting the out-of-distribution interpretation. revision: yes
Circularity Check
No circularity in empirical accuracy reporting on held-out data
full rationale
The paper's central results consist of measured classification accuracies (95.0% on 1,000 documents, 93.8% on 500 external samples) obtained by evaluating a fine-tuned model on explicitly held-out and separate validation sets. These are direct empirical observations from standard train/test splits and external evaluation, not quantities that reduce by construction to the training inputs, fitted parameters, or self-referential predictions. No equations, uniqueness theorems, or ansatzes are presented; the training data sources (13 licensed corpora plus GPT-4 synthetic) are described as inputs to model fitting, while test performance is reported as an independent measurement. No self-citations appear in a load-bearing role for the accuracy claims. The representativeness concern raised in the skeptic note is a question of external validity and dataset coverage, not a circularity in the derivation chain itself.
Axiom & Free-Parameter Ledger
free parameters (1)
- Fine-tuning hyperparameters
axioms (1)
- domain assumption GPT-4 synthetic data accurately supplements real samples for the seven security categories without introducing systematic label noise.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TorchSight, an open-source local system for security document classification built around a fine-tuned Qwen 3.5 27B model... trained on 78,358 samples... seven security categories and 51 subcategories
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Introduction Data Loss Prevention (DLP) remains a practical challenge in security work. Organizations in healthcare, finance, government, and other regulated sectors rely on DLP tools to reduce the risk of data leaks. Most current approaches rely on regular expressions, keyword lists, and related rule-based techniques (Arp et al., 2021; Thomas et al., 201...
work page 2021
-
[2]
could address all four limitations through their broad language understanding. Recent studies similarly note that large off-the-shelf language models perform strongly on general language and software-security tasks, yet their zero-shot performance on specialized cybersecurity tasks remains uneven (Jelodar et al., 2026; Huang, 2025). However, sending docum...
work page 2026
-
[3]
Related Work 2.1. Rule-based secret and vulnerability detection Existing secret-detection tools, including TruffleHog (Truffle Security, 2022), detect-secrets (Yelp, 2018), and GitLeaks (Zricethezav, 2019), mainly rely on regular expressions and entropy-based checks. This approach is effective when sensitive data follows a stable format, such as known API...
work page 2022
-
[4]
None provide a unified, on-premise model for cross-domain security document classification
demonstrates LLMs for privacy policy analysis. None provide a unified, on-premise model for cross-domain security document classification. 2.3. Data loss prevention systems Commercial DLP solutions – Symantec DLP (Broadcom, n.d.), Microsoft Purview (Microsoft, 2026), Nightfall AI (Nightfall, n.d.) – combine regex pattern matching, machine learning classif...
work page 2026
-
[5]
rely exclusively on regex pattern matching and have not been actively maintained since 2015 and 2016 respectively. To the best of our knowledge, no actively maintained open-source DLP system currently offers comparable LLM-based cross-domain security document classification. TorchSight fills this gap: a single fine-tuned LLM replaces the regex + ML + clou...
work page 2015
-
[6]
Security Taxonomy We define a hierarchical taxonomy of 7 top-level categories and 51 subcategories designed to cover the full spectrum of document security concerns. The taxonomy draws on NIST SP 800-53 (NIST, 2020a), MITRE ATT&CK (MITRE Corporation, 2023), OWASP Top 10 (OWASP, 2021), CWE (MITRE Corporation, n.d.), and compliance frameworks including GDPR...
work page 2023
-
[7]
Source: created by the author based on comparative results from the primary and external benchmarks
Accuracy of selected models on the primary (Eval-1000) and external (Eval-500) benchmarks. Source: created by the author based on comparative results from the primary and external benchmarks. Taken together, the primary and external benchmark results suggest that Beam q4_K_M’s advantage is not confined to the synthetic evaluation setting. Its performance ...
work page 2026
-
[8]
Comparison with prior work Two prior studies provide the closest comparators
Discussion and Comparison with Related Research 8.1. Comparison with prior work Two prior studies provide the closest comparators. Huang (2025) showed that domain-specific fine-tuning of foundation LLMs substantially improves performance on cybersecurity tasks relative to prompted general-purpose models, with LoRA and QLoRA reaching results close to full ...
work page 2025
-
[9]
Declaration of competing interest The author is the developer and maintainer of TorchSight, the open-source system evaluated in this study. The author declares no known financial competing interests or personal relationships that could have appeared to influence the work reported in this paper. References Aghaei E, Niu X, Shadid W, Al-Shaer E. SecureBERT:...
-
[10]
2025 May. Available at: https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf (accessed 26 April 2026). Arp D, Quiring E, Pendlebury F, Warnecke A, Pierazzi F, Wressnegger C, et al. Dos and don’ts of machine learning in computer security. arXiv [preprint]. 2021;2010.09470. Available at: https://arxiv.org/pdf/2010.09470 (accessed 4 Apr...
-
[11]
QLoRA: Efficient Finetuning of Quantized LLMs
Available at: https://mydlp.com (accessed 4 April 2026). Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: efficient finetuning of quantized LLMs. arXiv [preprint]. 2023;2305.14314. Available at: https://arxiv.org/pdf/2305.14314 (accessed 4 April 2026). European Parliament, Council of the European Union. Regulation (EU) 2016/679 of 27 April 2016 on...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Available at: https://github.com/ggerganov/llama.cpp (accessed 4 April 2026). Greshake K, Abdelnabi S, Mishra S, Endres C, Holz T, Fritz M. Not what you’ve signed up for: compromising real-world LLM-integrated applications with indirect prompt injection. arXiv [preprint]. 2023;2302.12173. Available at: https://arxiv.org/pdf/2302.12173 (accessed 4 April 20...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.25241 2026
-
[13]
Available at: https://learn.microsoft.com/en-us/purview/dlp-learn-about-dlp (accessed 4 April 2026). MITRE Corporation. MITRE ATT&CK v14
work page 2026
-
[14]
Available at: https://attack.mitre.org (accessed 4 April 2026). MITRE Corporation. Common Weakness Enumeration (CWE). n.d. Available at: https://cwe.mitre.org (accessed 4 April 2026). National Institute of Standards and Technology (NIST). Security and privacy controls for information systems and organizations. NIST Special Publication 800-53 Rev
work page 2026
-
[15]
Security and Privacy Controls for Information Syst ems and Organizations
2020a Sep (updates through 2020 Dec 10). https://doi.org/10.6028/NIST.SP.800-53r5 National Institute of Standards and Technology (NIST). National Vulnerability Database (NVD). 2020b. Available at: https://nvd.nist.gov (accessed 4 April 2026). Nightfall AI. Nightfall: AI data security & data loss prevention platform. n.d. Available at: https://www.nightfal...
-
[16]
p. 23–43. https://doi.org/10.1007/978-3-030-52683-2_2 Ollama Inc. Ollama [software]. n.d. Available at: https://ollama.com (accessed 4 April 2026). OpenAI. GPT-4 technical report. arXiv [preprint]. 2023;2303.08774 [cs.CL]. Available at: https://arxiv.org/pdf/2303.08774 (accessed 26 April 2026). OpenAI. GPT-5 system card. 2025 Aug
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/978-3-030-52683-2_2 2026
-
[17]
Available at: https://cdn.openai.com/gpt-5-system-card.pdf (accessed 4 April 2026). OWASP Foundation. OWASP Top 10:2021
work page 2026
-
[18]
PCI Security Standards Council
Available at: https://owasp.org/Top10/2021/ (accessed 4 April 2026). PCI Security Standards Council. Payment Card Industry Data Security Standard (PCI DSS) v4.0
work page 2021
- [19]
-
[20]
Shao Y, Li T, Shi W, Liu Y, Yang D
Available at: https://semgrep.dev (accessed 4 April 2026). Shao Y, Li T, Shi W, Liu Y, Yang D. PrivacyLens: evaluating privacy norm awareness of language models in action. arXiv [preprint]. 2024;2409.00138 [cs.CL]. https://doi.org/10.48550/arXiv.2409.00138 Slaviero A. OpenDLP. OpenDLP [software]
-
[21]
LLM4Vuln: A unified evaluation framework for decoupling and enhancing LLMs’ vulnerability reasoning
Available at: https://code.google.com/archive/p/opendlp (accessed 4 April 2026). Sun Y, Wu D, Xue Y, Liu H, Ma W, Zhang L, et al. LLM4Vuln: a unified evaluation framework for decoupling and enhancing LLMs' vulnerability reasoning. arXiv [preprint]. 2024;2401.16185. https://doi.org/10.48550/arXiv.2401.16185 Tauri Contributors. Tauri [software]
-
[22]
Thomas K, Huang DY, Wang DY, Bursztein E, Grier C, Holt TJ, et al
Available at: https://tauri.app (accessed 1 April 2026). Thomas K, Huang DY, Wang DY, Bursztein E, Grier C, Holt TJ, et al. Framing dependencies introduced by underground commoditization. In: Proc 14th Workshop on the Economics of Information Security (WEIS); Delft, The Netherlands
work page 2026
-
[23]
TorchSight. beam-training-data [dataset]. Hugging Face. n.d.-a. Available at: https://huggingface.co/datasets/torchsight/beam-training-data (accessed 4 April 2026). TorchSight. cybersecurity-classification-benchmark [dataset]. Hugging Face. n.d.-b. Available at: https://huggingface.co/datasets/torchsight/cybersecurity-classification-benchmark (accessed 4 ...
work page 2026
-
[24]
Available at: https://github.com/trufflesecurity/trufflehog (accessed 4 April 2026). U.S. Congress. Health Insurance Portability and Accountability Act of 1996 (HIPAA). Pub L No. 104-191
work page 2026
-
[25]
Available at: https://www.govinfo.gov/link/plaw/104/public/191 (accessed 2 April 2026). U.S. Department of State. International Traffic in Arms Regulations (ITAR). 22 CFR ch. I, subch. M, pts. 120–130
work page 2026
-
[26]
Available at: https://www.ecfr.gov/current/title-22/chapter-I/subchapter-M (accessed 2 April 2026). Yelp. detect-secrets [software]. GitHub
work page 2026
-
[27]
Zacharis A, Gavrila R, Patsakis C, Douligeris C
Available at: https://github.com/Yelp/detect-secrets (accessed 3 April 2026). Zacharis A, Gavrila R, Patsakis C, Douligeris C. Optimising AI models for intelligence extraction in the life cycle of Cybersecurity Threat Landscape generation. J Inf Secur Appl 2025;90:104037. https://doi.org/10.1016/j.jisa.2025.104037 Zricethezav. Gitleaks [software]. GitHub
-
[28]
Available at: https://github.com/gitleaks/gitleaks (accessed 4 April 2026)
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.