CyberMaskQA: A Privacy-Aware Benchmark for Evaluating Large Language Models in Cybersecurity Question Answering
Pith reviewed 2026-06-30 12:42 UTC · model grok-4.3
The pith
CyberMaskQA introduces a benchmark that grounds cybersecurity QA in realistic organizational contexts while annotating private entities for controlled disclosure.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CyberMaskQA is generated by a pipeline that starts with human-curated base scenarios and expands them through LLM-driven semantic variation, then annotates every instance with precise private-entity labels. The resulting questions embed explicit causal dependencies among assets and privileges inside realistic organizational settings across multiple security domains. This structure supports joint measurement of operational reasoning accuracy and privacy preservation under controlled disclosure conditions.
What carries the argument
The annotation of each instance with precise private entity labels that enable controlled information disclosure during evaluation.
If this is right
- Models can be trained and tested for simultaneous gains in cybersecurity reasoning and reduced leakage of sensitive identifiers.
- Researchers gain a tool to quantify privacy-utility trade-offs under realistic causal dependencies rather than isolated facts.
- The benchmark supports development of context-aware systems suitable for regulated environments where cloud models cannot process raw logs.
- Evaluations can separate performance on masked versus unmasked inputs to isolate the effect of privacy controls.
Where Pith is reading between the lines
- Organizations could use the same annotation scheme to preprocess their own logs before sending queries to external models.
- The causal-dependency structure might transfer to other domains that require both reasoning and selective redaction, such as medical or financial QA.
- Future extensions could test whether models trained on this benchmark generalize to new organizational scenarios not seen during dataset creation.
Load-bearing premise
The pipeline of human-curated base scenarios plus LLM semantic expansion produces instances whose private-entity labels correctly capture real operational contexts and causal structures.
What would settle it
A direct comparison in which models achieve identical accuracy and privacy scores on the released dataset versus a version with randomized or removed causal links and entity annotations.
Figures
read the original abstract
Large language models (LLMs) are increasingly applied to cybersecurity question answering (QA) for critical tasks such as incident response and vulnerability analysis. However, real-world operational contexts, including system logs and network configurations, inherently contain sensitive identifiers, e.g., IP addresses, host names, and user accounts. Processing this data with cloud-based models is often unsafe or infeasible in regulated environments. Furthermore, progress in privacy-preserving QA is hindered by the lack of annotated, context-rich datasets capable of jointly evaluating operational reasoning and privacy preservation. To address this gap, we introduce CYBERMASKQA, a privacy-aware QA benchmark covering key security domains. Unlike existing benchmarks that primarily test factual knowledge, CYBERMASKQA grounds questions in realistic organizational contexts with explicit causal dependencies among assets and privileges. Generated through a systematic pipeline, the dataset combines human-curated base scenarios with LLM-driven semantic expansion, annotating each instance with precise private entity labels to enable controlled information disclosure. Evaluations of QA accuracy and masking performance demonstrate the benchmark's utility for developing deployable, context-aware cybersecurity models and facilitating nuanced studies of privacy-utility trade-offs. Upon acceptance, we will release the dataset and the generation framework.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CYBERMASKQA, a privacy-aware QA benchmark for LLMs in cybersecurity. It claims the dataset grounds questions in realistic organizational contexts with explicit causal dependencies among assets and privileges, is generated via a pipeline of human-curated base scenarios plus LLM-driven semantic expansion, annotates each instance with precise private entity labels to support controlled disclosure, and includes evaluations of QA accuracy and masking performance to demonstrate utility for privacy-utility trade-off studies.
Significance. If the pipeline produces instances whose causal structures and private-entity annotations are verifiably accurate and realistic, the benchmark would address a genuine gap by enabling joint assessment of operational reasoning and privacy preservation in cybersecurity QA, supporting development of context-aware models suitable for regulated environments.
major comments (2)
- [Abstract] Abstract: the central claim that the benchmark 'grounds questions in realistic organizational contexts with explicit causal dependencies' and supplies 'precise private entity labels' rests on the unvalidated assertion that the human-curated + LLM-expansion pipeline preserves operational causality and correctly identifies all sensitive entities; no inter-annotator agreement, human review protocol, or quantitative checks on label accuracy or causality preservation are reported.
- [Abstract] Abstract (evaluations paragraph): the statement that 'Evaluations of QA accuracy and masking performance demonstrate the benchmark's utility' is unsupported because the manuscript supplies no quantitative results, baseline comparisons, or dataset statistics, preventing assessment of whether the claimed joint evaluation of reasoning and privacy is actually feasible.
minor comments (1)
- [Abstract] The abstract states the dataset and framework will be released upon acceptance but provides no information on licensing, access method, or format.
Simulated Author's Rebuttal
Thank you for the constructive review and the recommendation for major revision. We address the two major comments point by point below. We agree that the abstract claims require additional supporting evidence from the manuscript and will revise accordingly to include validation details and quantitative results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the benchmark 'grounds questions in realistic organizational contexts with explicit causal dependencies' and supplies 'precise private entity labels' rests on the unvalidated assertion that the human-curated + LLM-expansion pipeline preserves operational causality and correctly identifies all sensitive entities; no inter-annotator agreement, human review protocol, or quantitative checks on label accuracy or causality preservation are reported.
Authors: We agree that the abstract's central claims would be strengthened by explicit validation of the pipeline. The full manuscript (Section 3) details the human curation of base scenarios by domain experts to establish causal dependencies among assets and privileges, followed by LLM-driven expansion with rule-based post-processing to maintain those dependencies and entity annotations. However, no inter-annotator agreement, formal human review protocol, or quantitative accuracy metrics on causality preservation or label correctness are currently reported. We will add a dedicated validation subsection reporting these elements, including IAA scores on a sampled subset and accuracy checks against ground-truth causal graphs. revision: yes
-
Referee: [Abstract] Abstract (evaluations paragraph): the statement that 'Evaluations of QA accuracy and masking performance demonstrate the benchmark's utility' is unsupported because the manuscript supplies no quantitative results, baseline comparisons, or dataset statistics, preventing assessment of whether the claimed joint evaluation of reasoning and privacy is actually feasible.
Authors: The referee correctly observes that the manuscript does not currently include quantitative results, baselines, or dataset statistics to support the abstract's claim. The evaluations paragraph in the abstract is therefore unsupported as written. We will revise the manuscript to include a results section with QA accuracy metrics, masking performance (e.g., precision/recall on private entity detection), baseline comparisons against standard QA models, and basic dataset statistics (size, domain coverage, average causal chain length). The abstract will be updated to reference these specific findings. revision: yes
Circularity Check
No circularity: dataset construction with no derivation chain
full rationale
The paper introduces CYBERMASKQA as a benchmark dataset constructed via a human-curated base plus LLM semantic expansion pipeline, with private entity annotations. No equations, parameter fitting, uniqueness theorems, or ansatzes are present. The central claim is the existence and utility of the dataset for joint reasoning/privacy evaluation; this does not reduce to any self-referential definition or fitted input renamed as prediction. The contribution is self-contained as an artifact release rather than a closed-form result derived from its own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-driven semantic expansion from human-curated bases yields realistic organizational contexts and accurate private-entity annotations
invented entities (1)
-
CyberMaskQA benchmark with private entity annotations
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Cyberq: Generat- ing questions and answers for cybersecurity education using knowledge graph-augmented llms,
G. Agrawal, K. Pal, Y . Deng, H. Liu, and Y .-C. Chen, “Cyberq: Generat- ing questions and answers for cybersecurity education using knowledge graph-augmented llms,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 21, 2024, pp. 23 164–23 172
2024
-
[2]
CAN-QA: A Question-Answering Benchmark for Reasoning over In-Vehicle CAN Traffic
J. Chen, A. Deevi, O. Gungor, and T. Rosing, “Can-qa: A question- answering benchmark for reasoning over in-vehicle can traffic,”arXiv preprint arXiv:2604.24935, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
O. Gungor, R. Sood, H. Wang, and T. Rosing, “Aqua-llm: Evaluating accuracy, quantization, and adversarial robustness trade-offs in llms for cybersecurity question answering,”arXiv preprint arXiv:2509.13514, 2025
-
[4]
O. Gungor, R. Sood, J. Zhou, and T. Rosing, “Eager: Edge-aligned llm defense for robust, efficient, and accurate cybersecurity question answering,”arXiv preprint arXiv:2511.19523, 2025
-
[5]
An empirical study of sensitive information in logs,
R. Aghili, H. Li, and F. Khomh, “An empirical study of sensitive information in logs,”arXiv preprint arXiv:2409.11313, 2024
-
[6]
Security operations center: A systematic study and open challenges,
M. Vielberth, F. B ¨ohm, I. Fichtinger, and G. Pernul, “Security operations center: A systematic study and open challenges,”Ieee Access, vol. 8, pp. 227 756–227 779, 2020
2020
-
[7]
Examining the factors that impact the severity of cyberattacks on critical infrastructures,
Y . Roumani and M. Alraee, “Examining the factors that impact the severity of cyberattacks on critical infrastructures,”Computers & Secu- rity, vol. 148, p. 104074, 2025
2025
-
[8]
Z. Liu, “Secqa: A concise question-answering dataset for evalu- ating large language models in computer security,”arXiv preprint arXiv:2312.15838, 2023
-
[9]
Cybermetric: a benchmark dataset based on retrieval- augmented generation for evaluating llms in cybersecurity knowledge,
N. Tihanyiet al., “Cybermetric: a benchmark dataset based on retrieval- augmented generation for evaluating llms in cybersecurity knowledge,” in2024 IEEE International Conference on Cyber Security and Resilience (CSR). IEEE, 2024, pp. 296–302
2024
-
[10]
Ctibench: A benchmark for evaluating llms in cyber threat intelligence,
M. T. Alamet al., “Ctibench: A benchmark for evaluating llms in cyber threat intelligence,”Advances in Neural Information Processing Systems, vol. 37, pp. 50 805–50 825, 2024
2024
-
[11]
Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity,
Z. Liu, J. Shi, and J. F. Buford, “Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity,” inAAAI 2024 Workshop on Artificial Intelligence for Cyber Security, 2024
2024
-
[12]
Priv-qa: Privacy-preserving question answering for cloud large language models,
G. Li, Y . Zhang, Y . Wang, S. Yan, L. Wang, and T. Wei, “Priv-qa: Privacy-preserving question answering for cloud large language models,” arXiv preprint arXiv:2502.13564, 2025
-
[13]
Con- qa: Privacy-preserving qa using cloud llms in contract domain,
A. K. Singh, R. Surya, A. Tripathi, S. Choudhury, and S. Bisane, “Con- qa: Privacy-preserving qa using cloud llms in contract domain,”arXiv preprint arXiv:2509.19925, 2025
-
[14]
V . B. Krishna, “Attackqa: Development and adoption of a dataset for assisting cybersecurity operations using fine-tuned and open-source llms,” 2024. [Online]. Available: https://arxiv.org/abs/2411.01073
-
[15]
Cyberllama: A fine- tuned large language model for cybersecurity named entity recognition,
H. Zhang, T. Wu, T. Zhu, S. Wen, and Y . Xiang, “Cyberllama: A fine- tuned large language model for cybersecurity named entity recognition,” Knowledge-Based Systems, vol. 328, p. 114183, 2025
2025
-
[16]
pii-masking-300k (revision 86db63b),
Ai4Privacy, “pii-masking-300k (revision 86db63b),” 2024. [Online]. Available: https://huggingface.co/datasets/ai4privacy/pii-masking-300k
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.