pith. sign in

arxiv: 2605.20368 · v1 · pith:O6IIZUHZnew · submitted 2026-05-19 · 💻 cs.CR · cs.AI

Security Document Classification with a Fine-Tuned Local Large Language Model: Benchmark Data and an Open-Source System

Pith reviewed 2026-05-21 07:24 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords security document classificationlocal large language modelfine-tuningsensitive informationopen-source systemdata privacyQwen model
0
0 comments X

The pith

A fine-tuned local large language model classifies security documents at 95 percent accuracy while keeping all processing under local control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TorchSight, an open-source local system built around a fine-tuned Qwen 3.5 27B model for classifying documents into security categories. The model trains on 78,358 samples from 13 permissively licensed sources plus GPT-4 synthetic data that cover seven main categories and 51 subcategories. On a main test set of 1,000 documents it reaches 95.0 percent category-level accuracy and 93.8 percent on a separate external set of 500 samples, exceeding the scores of commercial models tested under identical prompting. Organizations that scan for sensitive information can therefore use the system without transmitting documents to external infrastructure. A sympathetic reader would care because the approach directly addresses the tension between needing context-aware classification and maintaining strict local control over confidential material.

Core claim

The central claim is that fine-tuning a local large language model on a dataset of 78,358 samples drawn from 13 permissively licensed sources together with GPT-4 synthetic data enables accurate classification of security documents into seven categories and 51 subcategories. The resulting system attains 95.0 percent category-level accuracy on a benchmark of 1,000 documents with a 95 percent confidence interval of 93.5 to 96.2 percent, while commercial models score between 75.4 and 79.9 percent under the same protocol. On an external held-out set of 500 samples the model reaches 93.8 percent accuracy. This performance demonstrates that accurate, context-sensitive classification can occur while

What carries the argument

The fine-tuned Qwen 3.5 27B model trained on 78,358 samples covering seven security categories and 51 subcategories, which performs the entire classification task on local hardware without external data transmission.

If this is right

  • Organizations can classify sensitive documents accurately without sending data to cloud services.
  • Document processing remains under local control, reducing exposure risks during scanning.
  • The approach outperforms commercial models when both use the same prompting protocol.
  • Performance holds at 93.8 percent on a separate external validation set of 500 samples.
  • An open-source implementation and benchmark dataset become available for further local security tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fine-tuning pattern could extend to context-aware classification of sensitive material in regulated fields such as healthcare records or financial filings.
  • Smaller distilled versions of the model might enable deployment on standard office hardware while preserving most of the accuracy.
  • Integration with local file systems could allow automated flagging of documents before they enter shared repositories.

Load-bearing premise

The 78,358 samples from 13 permissively licensed sources combined with GPT-4 synthetic data sufficiently represent the diversity, context, and boundary cases of real-world security documents across the seven categories and 51 subcategories.

What would settle it

Testing the released model on a new collection of real security documents drawn from an organization or time period absent from the training sources and measuring whether category-level accuracy falls substantially below 90 percent.

Figures

Figures reproduced from arXiv: 2605.20368 by Ivan Dobrovolskyi.

Figure 5
Figure 5. Figure 5: Accuracy of selected models on the primary (Eval-1000) and external (Eval￾500) benchmarks. Source: created by the author based on comparative results from the primary and external benchmarks. Taken together, the primary and external benchmark results suggest that Beam q4_K_M’s advantage is not confined to the synthetic evaluation setting. Its performance remains higher than the tested commercial baselines … view at source ↗
read the original abstract

Organizations that scan documents for sensitive information face a practical problem. Cloud services require data to be sent to external infrastructure, while rule-based tools often miss threats that depend on context. This study presents TorchSight, an open-source local system for security document classification built around a fine-tuned Qwen 3.5 27B model. The model was trained on 78,358 samples from 13 permissively licensed sources and GPT-4 synthetic data covering seven security categories and 51 subcategories. In the main evaluation on 1,000 documents, the model reached 95.0% category-level accuracy (95% confidence interval: 93.5-96.2). The tested commercial models scored 75.4-79.9% under the same prompting protocol. On a separate external set of 500 held-out samples, the model reached 93.8% accuracy, which suggests that performance extends beyond the main benchmark, although the margin depends on dataset composition and difficult boundary cases. The results show that a fine-tuned local model can support accurate security document classification while keeping document processing under local control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TorchSight, an open-source local system for security document classification using a fine-tuned Qwen 3.5 27B model. The system is trained on 78,358 samples drawn from 13 permissively licensed sources augmented with GPT-4 synthetic data spanning seven security categories and 51 subcategories. On a 1,000-document internal benchmark the model achieves 95.0% category-level accuracy (95% CI 93.5-96.2), outperforming prompted commercial models (75.4-79.9%) under identical prompting; on a separate 500-sample external held-out set it reaches 93.8% accuracy.

Significance. If the reported accuracies and generalization hold, the work demonstrates that a fine-tuned local LLM can deliver high-accuracy security document classification while preserving data locality, offering a practical alternative to cloud services. The release of benchmark data and open-source code supports reproducibility and further community evaluation in privacy-sensitive domains.

major comments (2)
  1. [Section 3] Section 3 (Data Curation): The construction of the 78,358-sample training set from 13 sources plus GPT-4 synthetic data is presented without quantitative coverage analysis (e.g., per-subcategory sample counts, representation of mixed-category or novel-format documents, or explicit handling of boundary cases). Because the central claim of reliable 93.8% performance on the external set rests on the assumption that this distribution matches operational security documents, the absence of such diagnostics weakens the generalization argument.
  2. [Section 5.2] Section 5.2 (External Validation): The external set of 500 samples is described as held-out, yet the paper does not report whether any of these samples were used in GPT-4 synthetic data generation or share lexical/contextual overlap with the training distribution. This detail is load-bearing for interpreting the 93.8% accuracy as evidence of true out-of-distribution robustness rather than partial memorization.
minor comments (2)
  1. [Abstract] Abstract and Section 1: The caveat that 'the margin depends on dataset composition and difficult boundary cases' is stated but not quantified; adding a short limitations paragraph with concrete examples of failure modes would improve clarity without altering the main results.
  2. [Table 1] Table 1 or equivalent results table: Ensure confidence intervals are reported consistently for all compared models and that the prompting templates used for commercial baselines are reproduced verbatim in an appendix.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and indicate where revisions will be made to improve the manuscript.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Data Curation): The construction of the 78,358-sample training set from 13 sources plus GPT-4 synthetic data is presented without quantitative coverage analysis (e.g., per-subcategory sample counts, representation of mixed-category or novel-format documents, or explicit handling of boundary cases). Because the central claim of reliable 93.8% performance on the external set rests on the assumption that this distribution matches operational security documents, the absence of such diagnostics weakens the generalization argument.

    Authors: We agree that additional quantitative diagnostics in Section 3 would strengthen the generalization argument. In the revised manuscript we will add per-subcategory sample counts, a brief discussion of mixed-category and novel-format documents, and our curation approach to boundary cases. revision: yes

  2. Referee: [Section 5.2] Section 5.2 (External Validation): The external set of 500 samples is described as held-out, yet the paper does not report whether any of these samples were used in GPT-4 synthetic data generation or share lexical/contextual overlap with the training distribution. This detail is load-bearing for interpreting the 93.8% accuracy as evidence of true out-of-distribution robustness rather than partial memorization.

    Authors: The external set was drawn from sources disjoint from the 13 training sources and was not used to generate the GPT-4 synthetic data. We will revise Section 5.2 to explicitly describe this source-level separation and confirm that no samples from the external set entered the synthetic augmentation pipeline, thereby supporting the out-of-distribution interpretation. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical accuracy reporting on held-out data

full rationale

The paper's central results consist of measured classification accuracies (95.0% on 1,000 documents, 93.8% on 500 external samples) obtained by evaluating a fine-tuned model on explicitly held-out and separate validation sets. These are direct empirical observations from standard train/test splits and external evaluation, not quantities that reduce by construction to the training inputs, fitted parameters, or self-referential predictions. No equations, uniqueness theorems, or ansatzes are presented; the training data sources (13 licensed corpora plus GPT-4 synthetic) are described as inputs to model fitting, while test performance is reported as an independent measurement. No self-citations appear in a load-bearing role for the accuracy claims. The representativeness concern raised in the skeptic note is a question of external validity and dataset coverage, not a circularity in the derivation chain itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the quality and representativeness of the mixed real-plus-synthetic training data and on standard but unspecified fine-tuning choices.

free parameters (1)
  • Fine-tuning hyperparameters
    Learning rate, batch size, number of epochs and other training settings are chosen to reach the reported accuracy; these are fitted or selected during development.
axioms (1)
  • domain assumption GPT-4 synthetic data accurately supplements real samples for the seven security categories without introducing systematic label noise.
    The paper relies on this to reach 78k training samples covering 51 subcategories.

pith-pipeline@v0.9.0 · 5728 in / 1362 out tokens · 62954 ms · 2026-05-21T07:24:19.126902+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 3 internal anchors

  1. [1]

    Organizations in healthcare, finance, government, and other regulated sectors rely on DLP tools to reduce the risk of data leaks

    Introduction Data Loss Prevention (DLP) remains a practical challenge in security work. Organizations in healthcare, finance, government, and other regulated sectors rely on DLP tools to reduce the risk of data leaks. Most current approaches rely on regular expressions, keyword lists, and related rule-based techniques (Arp et al., 2021; Thomas et al., 201...

  2. [2]

    could address all four limitations through their broad language understanding. Recent studies similarly note that large off-the-shelf language models perform strongly on general language and software-security tasks, yet their zero-shot performance on specialized cybersecurity tasks remains uneven (Jelodar et al., 2026; Huang, 2025). However, sending docum...

  3. [3]

    Related Work 2.1. Rule-based secret and vulnerability detection Existing secret-detection tools, including TruffleHog (Truffle Security, 2022), detect-secrets (Yelp, 2018), and GitLeaks (Zricethezav, 2019), mainly rely on regular expressions and entropy-based checks. This approach is effective when sensitive data follows a stable format, such as known API...

  4. [4]

    None provide a unified, on-premise model for cross-domain security document classification

    demonstrates LLMs for privacy policy analysis. None provide a unified, on-premise model for cross-domain security document classification. 2.3. Data loss prevention systems Commercial DLP solutions – Symantec DLP (Broadcom, n.d.), Microsoft Purview (Microsoft, 2026), Nightfall AI (Nightfall, n.d.) – combine regex pattern matching, machine learning classif...

  5. [5]

    To the best of our knowledge, no actively maintained open-source DLP system currently offers comparable LLM-based cross-domain security document classification

    rely exclusively on regex pattern matching and have not been actively maintained since 2015 and 2016 respectively. To the best of our knowledge, no actively maintained open-source DLP system currently offers comparable LLM-based cross-domain security document classification. TorchSight fills this gap: a single fine-tuned LLM replaces the regex + ML + clou...

  6. [6]

    Security Taxonomy We define a hierarchical taxonomy of 7 top-level categories and 51 subcategories designed to cover the full spectrum of document security concerns. The taxonomy draws on NIST SP 800-53 (NIST, 2020a), MITRE ATT&CK (MITRE Corporation, 2023), OWASP Top 10 (OWASP, 2021), CWE (MITRE Corporation, n.d.), and compliance frameworks including GDPR...

  7. [7]

    Source: created by the author based on comparative results from the primary and external benchmarks

    Accuracy of selected models on the primary (Eval-1000) and external (Eval-500) benchmarks. Source: created by the author based on comparative results from the primary and external benchmarks. Taken together, the primary and external benchmark results suggest that Beam q4_K_M’s advantage is not confined to the synthetic evaluation setting. Its performance ...

  8. [8]

    Comparison with prior work Two prior studies provide the closest comparators

    Discussion and Comparison with Related Research 8.1. Comparison with prior work Two prior studies provide the closest comparators. Huang (2025) showed that domain-specific fine-tuning of foundation LLMs substantially improves performance on cybersecurity tasks relative to prompted general-purpose models, with LoRA and QLoRA reaching results close to full ...

  9. [9]

    The author declares no known financial competing interests or personal relationships that could have appeared to influence the work reported in this paper

    Declaration of competing interest The author is the developer and maintainer of TorchSight, the open-source system evaluated in this study. The author declares no known financial competing interests or personal relationships that could have appeared to influence the work reported in this paper. References Aghaei E, Niu X, Shadid W, Al-Shaer E. SecureBERT:...

  10. [10]

    Available at: https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf (accessed 26 April 2026)

    2025 May. Available at: https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf (accessed 26 April 2026). Arp D, Quiring E, Pendlebury F, Warnecke A, Pierazzi F, Wressnegger C, et al. Dos and don’ts of machine learning in computer security. arXiv [preprint]. 2021;2010.09470. Available at: https://arxiv.org/pdf/2010.09470 (accessed 4 Apr...

  11. [11]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Available at: https://mydlp.com (accessed 4 April 2026). Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: efficient finetuning of quantized LLMs. arXiv [preprint]. 2023;2305.14314. Available at: https://arxiv.org/pdf/2305.14314 (accessed 4 April 2026). European Parliament, Council of the European Union. Regulation (EU) 2016/679 of 27 April 2016 on...

  12. [12]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    Available at: https://github.com/ggerganov/llama.cpp (accessed 4 April 2026). Greshake K, Abdelnabi S, Mishra S, Endres C, Holz T, Fritz M. Not what you’ve signed up for: compromising real-world LLM-integrated applications with indirect prompt injection. arXiv [preprint]. 2023;2302.12173. Available at: https://arxiv.org/pdf/2302.12173 (accessed 4 April 20...

  13. [13]

    MITRE Corporation

    Available at: https://learn.microsoft.com/en-us/purview/dlp-learn-about-dlp (accessed 4 April 2026). MITRE Corporation. MITRE ATT&CK v14

  14. [14]

    MITRE Corporation

    Available at: https://attack.mitre.org (accessed 4 April 2026). MITRE Corporation. Common Weakness Enumeration (CWE). n.d. Available at: https://cwe.mitre.org (accessed 4 April 2026). National Institute of Standards and Technology (NIST). Security and privacy controls for information systems and organizations. NIST Special Publication 800-53 Rev

  15. [15]

    Security and Privacy Controls for Information Syst ems and Organizations

    2020a Sep (updates through 2020 Dec 10). https://doi.org/10.6028/NIST.SP.800-53r5 National Institute of Standards and Technology (NIST). National Vulnerability Database (NVD). 2020b. Available at: https://nvd.nist.gov (accessed 4 April 2026). Nightfall AI. Nightfall: AI data security & data loss prevention platform. n.d. Available at: https://www.nightfal...

  16. [16]

    p. 23–43. https://doi.org/10.1007/978-3-030-52683-2_2 Ollama Inc. Ollama [software]. n.d. Available at: https://ollama.com (accessed 4 April 2026). OpenAI. GPT-4 technical report. arXiv [preprint]. 2023;2303.08774 [cs.CL]. Available at: https://arxiv.org/pdf/2303.08774 (accessed 26 April 2026). OpenAI. GPT-5 system card. 2025 Aug

  17. [17]

    OWASP Foundation

    Available at: https://cdn.openai.com/gpt-5-system-card.pdf (accessed 4 April 2026). OWASP Foundation. OWASP Top 10:2021

  18. [18]

    PCI Security Standards Council

    Available at: https://owasp.org/Top10/2021/ (accessed 4 April 2026). PCI Security Standards Council. Payment Card Industry Data Security Standard (PCI DSS) v4.0

  19. [19]

    Qwen Team

    Available at: https://www.pcisecuritystandards.org/document_library/ (accessed 4 April 2026). Qwen Team. Qwen3.5: towards native multimodal agents. 2026 Feb. Available at: https://qwen.ai/blog?id=qwen3.5 (accessed 26 April 2026). R2C. Semgrep [software]

  20. [20]

    Shao Y, Li T, Shi W, Liu Y, Yang D

    Available at: https://semgrep.dev (accessed 4 April 2026). Shao Y, Li T, Shi W, Liu Y, Yang D. PrivacyLens: evaluating privacy norm awareness of language models in action. arXiv [preprint]. 2024;2409.00138 [cs.CL]. https://doi.org/10.48550/arXiv.2409.00138 Slaviero A. OpenDLP. OpenDLP [software]

  21. [21]

    LLM4Vuln: A unified evaluation framework for decoupling and enhancing LLMs’ vulnerability reasoning

    Available at: https://code.google.com/archive/p/opendlp (accessed 4 April 2026). Sun Y, Wu D, Xue Y, Liu H, Ma W, Zhang L, et al. LLM4Vuln: a unified evaluation framework for decoupling and enhancing LLMs' vulnerability reasoning. arXiv [preprint]. 2024;2401.16185. https://doi.org/10.48550/arXiv.2401.16185 Tauri Contributors. Tauri [software]

  22. [22]

    Thomas K, Huang DY, Wang DY, Bursztein E, Grier C, Holt TJ, et al

    Available at: https://tauri.app (accessed 1 April 2026). Thomas K, Huang DY, Wang DY, Bursztein E, Grier C, Holt TJ, et al. Framing dependencies introduced by underground commoditization. In: Proc 14th Workshop on the Economics of Information Security (WEIS); Delft, The Netherlands

  23. [23]

    beam-training-data [dataset]

    TorchSight. beam-training-data [dataset]. Hugging Face. n.d.-a. Available at: https://huggingface.co/datasets/torchsight/beam-training-data (accessed 4 April 2026). TorchSight. cybersecurity-classification-benchmark [dataset]. Hugging Face. n.d.-b. Available at: https://huggingface.co/datasets/torchsight/cybersecurity-classification-benchmark (accessed 4 ...

  24. [24]

    Available at: https://github.com/trufflesecurity/trufflehog (accessed 4 April 2026). U.S. Congress. Health Insurance Portability and Accountability Act of 1996 (HIPAA). Pub L No. 104-191

  25. [25]

    Available at: https://www.govinfo.gov/link/plaw/104/public/191 (accessed 2 April 2026). U.S. Department of State. International Traffic in Arms Regulations (ITAR). 22 CFR ch. I, subch. M, pts. 120–130

  26. [26]

    Available at: https://www.ecfr.gov/current/title-22/chapter-I/subchapter-M (accessed 2 April 2026). Yelp. detect-secrets [software]. GitHub

  27. [27]

    Zacharis A, Gavrila R, Patsakis C, Douligeris C

    Available at: https://github.com/Yelp/detect-secrets (accessed 3 April 2026). Zacharis A, Gavrila R, Patsakis C, Douligeris C. Optimising AI models for intelligence extraction in the life cycle of Cybersecurity Threat Landscape generation. J Inf Secur Appl 2025;90:104037. https://doi.org/10.1016/j.jisa.2025.104037 Zricethezav. Gitleaks [software]. GitHub

  28. [28]

    Available at: https://github.com/gitleaks/gitleaks (accessed 4 April 2026)