CyberMaskQA: A Privacy-Aware Benchmark for Evaluating Large Language Models in Cybersecurity Question Answering

Jin Noh; Matilda Gaddi; Onat Gungor; Tajana Rosing

arxiv: 2605.24765 · v1 · pith:F3DALRHGnew · submitted 2026-05-23 · 💻 cs.CR · cs.LG

CyberMaskQA: A Privacy-Aware Benchmark for Evaluating Large Language Models in Cybersecurity Question Answering

Matilda Gaddi , Jin Noh , Onat Gungor , Tajana Rosing This is my paper

Pith reviewed 2026-06-30 12:42 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords cybersecurityprivacy preservationquestion answering benchmarklarge language modelsinformation maskingcontextual reasoningsensitive entity annotation

0 comments

The pith

CyberMaskQA introduces a benchmark that grounds cybersecurity QA in realistic organizational contexts while annotating private entities for controlled disclosure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CyberMaskQA as a new dataset for testing large language models on cybersecurity tasks such as incident response. It creates questions from organizational scenarios that include explicit causal links among assets, privileges, and events. Each question carries labels for sensitive items like IP addresses and host names so that information can be masked or revealed in measured ways. This setup lets researchers measure both how well models reason about security operations and how effectively they avoid leaking private details. The authors argue the benchmark fills a gap left by existing datasets that focus only on factual recall without operational context or privacy constraints.

Core claim

CyberMaskQA is generated by a pipeline that starts with human-curated base scenarios and expands them through LLM-driven semantic variation, then annotates every instance with precise private-entity labels. The resulting questions embed explicit causal dependencies among assets and privileges inside realistic organizational settings across multiple security domains. This structure supports joint measurement of operational reasoning accuracy and privacy preservation under controlled disclosure conditions.

What carries the argument

The annotation of each instance with precise private entity labels that enable controlled information disclosure during evaluation.

If this is right

Models can be trained and tested for simultaneous gains in cybersecurity reasoning and reduced leakage of sensitive identifiers.
Researchers gain a tool to quantify privacy-utility trade-offs under realistic causal dependencies rather than isolated facts.
The benchmark supports development of context-aware systems suitable for regulated environments where cloud models cannot process raw logs.
Evaluations can separate performance on masked versus unmasked inputs to isolate the effect of privacy controls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Organizations could use the same annotation scheme to preprocess their own logs before sending queries to external models.
The causal-dependency structure might transfer to other domains that require both reasoning and selective redaction, such as medical or financial QA.
Future extensions could test whether models trained on this benchmark generalize to new organizational scenarios not seen during dataset creation.

Load-bearing premise

The pipeline of human-curated base scenarios plus LLM semantic expansion produces instances whose private-entity labels correctly capture real operational contexts and causal structures.

What would settle it

A direct comparison in which models achieve identical accuracy and privacy scores on the released dataset versus a version with randomized or removed causal links and entity annotations.

Figures

Figures reproduced from arXiv: 2605.24765 by Jin Noh, Matilda Gaddi, Onat Gungor, Tajana Rosing.

**Figure 1.** Figure 1: Overview of CYBERMASKQA generation framework. The generated data is subsequently validated by human annotators to ensure quality and consistency. 3) Masking: This stage applies a masking protocol to anonymize sensitive information. Identifiers such as usernames and IP addresses are replaced with standardized tags to preserve privacy. 4) Dataset Expansion: Finally, the dataset is expanded by incorporating … view at source ↗

**Figure 3.** Figure 3: Organizational Attributes Example Version 1: Employee Record Question: “We have received a request to delete personal data for user EMP-3301. Should we honor this request?” Context: The file EmployeeRecords.csv lists EMP-3301 as an active employee in the Engineering department. The Employee Data Policy states that employee data cannot be deleted while the employee is actively employed. Correct Answer: No. … view at source ↗

**Figure 2.** Figure 2: Operational Schema Example realistic operational decision-making. 1) Identity & Access Management: This domain covers the lifecycles of user permissions, access control lists based on role, and enforcing the Principle of Least Privilege. Including this domain ensures that the model is tested on the ability to determine if a specific user possesses the privilege or authorization to perform a requested actio… view at source ↗

**Figure 4.** Figure 4: Context Difference Example. ments containing structured system logs through a CSV file, unstructured organizational policies, and a context-dependent query [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Context Correction Example TABLE III MASKING LABELS AND THEIR CORRESPONDING SENSITIVE DATA TYPES. Label Description USER ACCOUNT Person names, employee IDs, usernames, user IDs IP ADDRESS IP addresses, optionally including ports URL Web addresses FILE PATH File or directory paths HOSTNAME Network hostnames SOFTWARE Application or system software names OS Operating systems HARDWARE Devices, servers, or comp… view at source ↗

**Figure 6.** Figure 6: MCQA accuracy across the four CYBERMASKQA categories. of inferences required to answer the questions. Identity & Access Management is the most challenging category because correct answers require integrating information across multiple documents: a user’s group membership in one file, the group’s permissions in another, and the resource’s requirements in a third. For instance, determining whether user tjon… view at source ↗

**Figure 7.** Figure 7: Masking performance across the four CYBERMASKQA categories. ing patterns, such as IP_ADDRESS and FILE_PATH, achieve higher masking performance, whereas entities with more varied formats and common nouns, such as IDENTIFIER and GROUP, exhibit lower performance. IDENTIFIER entities (incident IDs, ticket numbers, equipment IDs such as LAB-SEQ-07 or INC-20481) show this ambiguity challenge. Unlike IP_ADDRESS… view at source ↗

**Figure 8.** Figure 8: Entity-level masking performance averaged across the four LLMs. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Relationship between QA accuracy and masking recall. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly applied to cybersecurity question answering (QA) for critical tasks such as incident response and vulnerability analysis. However, real-world operational contexts, including system logs and network configurations, inherently contain sensitive identifiers, e.g., IP addresses, host names, and user accounts. Processing this data with cloud-based models is often unsafe or infeasible in regulated environments. Furthermore, progress in privacy-preserving QA is hindered by the lack of annotated, context-rich datasets capable of jointly evaluating operational reasoning and privacy preservation. To address this gap, we introduce CYBERMASKQA, a privacy-aware QA benchmark covering key security domains. Unlike existing benchmarks that primarily test factual knowledge, CYBERMASKQA grounds questions in realistic organizational contexts with explicit causal dependencies among assets and privileges. Generated through a systematic pipeline, the dataset combines human-curated base scenarios with LLM-driven semantic expansion, annotating each instance with precise private entity labels to enable controlled information disclosure. Evaluations of QA accuracy and masking performance demonstrate the benchmark's utility for developing deployable, context-aware cybersecurity models and facilitating nuanced studies of privacy-utility trade-offs. Upon acceptance, we will release the dataset and the generation framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CyberMaskQA targets a real gap with private-entity annotations in cyber QA but the abstract shows no validation or results, so the pipeline's quality is unproven.

read the letter

This paper introduces CyberMaskQA as a benchmark that adds private entity annotations to cybersecurity QA scenarios to allow testing both reasoning and privacy preservation. It does well by focusing on a practical problem in regulated environments where sending logs to LLMs is risky, and by trying to include causal dependencies in the questions. The pipeline of human-curated bases plus LLM semantic expansion is described, and they plan to release the dataset.

However, the abstract supplies no quantitative results, no validation metrics for the annotations, and no comparison to existing benchmarks. The stress-test concern about unvalidated LLM expansion holds here because there's no inter-annotator agreement or human review protocol mentioned. Without that, it's difficult to trust that the instances are realistic or that the labels are accurate. The central claim that it enables nuanced studies of privacy-utility trade-offs can't be evaluated from what's given.

This is for researchers in AI for cybersecurity who care about privacy constraints. A serious referee could look at it if the full paper includes the promised evaluations and the dataset, as the problem it targets is important. I would recommend engaging with the work once the data is out, to see if the generation pipeline produces usable examples.

Referee Report

2 major / 1 minor

Summary. The paper introduces CYBERMASKQA, a privacy-aware QA benchmark for LLMs in cybersecurity. It claims the dataset grounds questions in realistic organizational contexts with explicit causal dependencies among assets and privileges, is generated via a pipeline of human-curated base scenarios plus LLM-driven semantic expansion, annotates each instance with precise private entity labels to support controlled disclosure, and includes evaluations of QA accuracy and masking performance to demonstrate utility for privacy-utility trade-off studies.

Significance. If the pipeline produces instances whose causal structures and private-entity annotations are verifiably accurate and realistic, the benchmark would address a genuine gap by enabling joint assessment of operational reasoning and privacy preservation in cybersecurity QA, supporting development of context-aware models suitable for regulated environments.

major comments (2)

[Abstract] Abstract: the central claim that the benchmark 'grounds questions in realistic organizational contexts with explicit causal dependencies' and supplies 'precise private entity labels' rests on the unvalidated assertion that the human-curated + LLM-expansion pipeline preserves operational causality and correctly identifies all sensitive entities; no inter-annotator agreement, human review protocol, or quantitative checks on label accuracy or causality preservation are reported.
[Abstract] Abstract (evaluations paragraph): the statement that 'Evaluations of QA accuracy and masking performance demonstrate the benchmark's utility' is unsupported because the manuscript supplies no quantitative results, baseline comparisons, or dataset statistics, preventing assessment of whether the claimed joint evaluation of reasoning and privacy is actually feasible.

minor comments (1)

[Abstract] The abstract states the dataset and framework will be released upon acceptance but provides no information on licensing, access method, or format.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review and the recommendation for major revision. We address the two major comments point by point below. We agree that the abstract claims require additional supporting evidence from the manuscript and will revise accordingly to include validation details and quantitative results.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the benchmark 'grounds questions in realistic organizational contexts with explicit causal dependencies' and supplies 'precise private entity labels' rests on the unvalidated assertion that the human-curated + LLM-expansion pipeline preserves operational causality and correctly identifies all sensitive entities; no inter-annotator agreement, human review protocol, or quantitative checks on label accuracy or causality preservation are reported.

Authors: We agree that the abstract's central claims would be strengthened by explicit validation of the pipeline. The full manuscript (Section 3) details the human curation of base scenarios by domain experts to establish causal dependencies among assets and privileges, followed by LLM-driven expansion with rule-based post-processing to maintain those dependencies and entity annotations. However, no inter-annotator agreement, formal human review protocol, or quantitative accuracy metrics on causality preservation or label correctness are currently reported. We will add a dedicated validation subsection reporting these elements, including IAA scores on a sampled subset and accuracy checks against ground-truth causal graphs. revision: yes
Referee: [Abstract] Abstract (evaluations paragraph): the statement that 'Evaluations of QA accuracy and masking performance demonstrate the benchmark's utility' is unsupported because the manuscript supplies no quantitative results, baseline comparisons, or dataset statistics, preventing assessment of whether the claimed joint evaluation of reasoning and privacy is actually feasible.

Authors: The referee correctly observes that the manuscript does not currently include quantitative results, baselines, or dataset statistics to support the abstract's claim. The evaluations paragraph in the abstract is therefore unsupported as written. We will revise the manuscript to include a results section with QA accuracy metrics, masking performance (e.g., precision/recall on private entity detection), baseline comparisons against standard QA models, and basic dataset statistics (size, domain coverage, average causal chain length). The abstract will be updated to reference these specific findings. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction with no derivation chain

full rationale

The paper introduces CYBERMASKQA as a benchmark dataset constructed via a human-curated base plus LLM semantic expansion pipeline, with private entity annotations. No equations, parameter fitting, uniqueness theorems, or ansatzes are present. The central claim is the existence and utility of the dataset for joint reasoning/privacy evaluation; this does not reduce to any self-referential definition or fitted input renamed as prediction. The contribution is self-contained as an artifact release rather than a closed-form result derived from its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the unverified quality of the LLM-augmented generation process and the assumption that labeled scenarios capture real causal structures in cybersecurity operations.

axioms (1)

domain assumption LLM-driven semantic expansion from human-curated bases yields realistic organizational contexts and accurate private-entity annotations
Invoked in the description of the systematic pipeline that generates the dataset instances.

invented entities (1)

CyberMaskQA benchmark with private entity annotations no independent evidence
purpose: Provide context-rich, privacy-labeled cybersecurity QA instances for joint accuracy and masking evaluation
Newly constructed artifact whose validity rests on the generation pipeline rather than external evidence.

pith-pipeline@v0.9.1-grok · 5752 in / 1243 out tokens · 29637 ms · 2026-06-30T12:42:40.396600+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Cyberq: Generat- ing questions and answers for cybersecurity education using knowledge graph-augmented llms,

G. Agrawal, K. Pal, Y . Deng, H. Liu, and Y .-C. Chen, “Cyberq: Generat- ing questions and answers for cybersecurity education using knowledge graph-augmented llms,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 21, 2024, pp. 23 164–23 172

2024
[2]

CAN-QA: A Question-Answering Benchmark for Reasoning over In-Vehicle CAN Traffic

J. Chen, A. Deevi, O. Gungor, and T. Rosing, “Can-qa: A question- answering benchmark for reasoning over in-vehicle can traffic,”arXiv preprint arXiv:2604.24935, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Aqua-llm: Evaluating accuracy, quantization, and adversarial robustness trade-offs in llms for cybersecurity question answering,

O. Gungor, R. Sood, H. Wang, and T. Rosing, “Aqua-llm: Evaluating accuracy, quantization, and adversarial robustness trade-offs in llms for cybersecurity question answering,”arXiv preprint arXiv:2509.13514, 2025

work page arXiv 2025
[4]

Eager: Edge-aligned llm defense for robust, efficient, and accurate cybersecurity question answering,

O. Gungor, R. Sood, J. Zhou, and T. Rosing, “Eager: Edge-aligned llm defense for robust, efficient, and accurate cybersecurity question answering,”arXiv preprint arXiv:2511.19523, 2025

work page arXiv 2025
[5]

An empirical study of sensitive information in logs,

R. Aghili, H. Li, and F. Khomh, “An empirical study of sensitive information in logs,”arXiv preprint arXiv:2409.11313, 2024

work page arXiv 2024
[6]

Security operations center: A systematic study and open challenges,

M. Vielberth, F. B ¨ohm, I. Fichtinger, and G. Pernul, “Security operations center: A systematic study and open challenges,”Ieee Access, vol. 8, pp. 227 756–227 779, 2020

2020
[7]

Examining the factors that impact the severity of cyberattacks on critical infrastructures,

Y . Roumani and M. Alraee, “Examining the factors that impact the severity of cyberattacks on critical infrastructures,”Computers & Secu- rity, vol. 148, p. 104074, 2025

2025
[8]

Secqa: A concise question-answering dataset for evalu- ating large language models in computer security,

Z. Liu, “Secqa: A concise question-answering dataset for evalu- ating large language models in computer security,”arXiv preprint arXiv:2312.15838, 2023

work page arXiv 2023
[9]

Cybermetric: a benchmark dataset based on retrieval- augmented generation for evaluating llms in cybersecurity knowledge,

N. Tihanyiet al., “Cybermetric: a benchmark dataset based on retrieval- augmented generation for evaluating llms in cybersecurity knowledge,” in2024 IEEE International Conference on Cyber Security and Resilience (CSR). IEEE, 2024, pp. 296–302

2024
[10]

Ctibench: A benchmark for evaluating llms in cyber threat intelligence,

M. T. Alamet al., “Ctibench: A benchmark for evaluating llms in cyber threat intelligence,”Advances in Neural Information Processing Systems, vol. 37, pp. 50 805–50 825, 2024

2024
[11]

Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity,

Z. Liu, J. Shi, and J. F. Buford, “Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity,” inAAAI 2024 Workshop on Artificial Intelligence for Cyber Security, 2024

2024
[12]

Priv-qa: Privacy-preserving question answering for cloud large language models,

G. Li, Y . Zhang, Y . Wang, S. Yan, L. Wang, and T. Wei, “Priv-qa: Privacy-preserving question answering for cloud large language models,” arXiv preprint arXiv:2502.13564, 2025

work page arXiv 2025
[13]

Con- qa: Privacy-preserving qa using cloud llms in contract domain,

A. K. Singh, R. Surya, A. Tripathi, S. Choudhury, and S. Bisane, “Con- qa: Privacy-preserving qa using cloud llms in contract domain,”arXiv preprint arXiv:2509.19925, 2025

work page arXiv 2025
[14]

AttackQA: Development and adoption of a dataset for assisting cybersecurity operations using fine-tuned and open-source LLMs.arXiv:2411.01073, 2024

V . B. Krishna, “Attackqa: Development and adoption of a dataset for assisting cybersecurity operations using fine-tuned and open-source llms,” 2024. [Online]. Available: https://arxiv.org/abs/2411.01073

work page arXiv 2024
[15]

Cyberllama: A fine- tuned large language model for cybersecurity named entity recognition,

H. Zhang, T. Wu, T. Zhu, S. Wen, and Y . Xiang, “Cyberllama: A fine- tuned large language model for cybersecurity named entity recognition,” Knowledge-Based Systems, vol. 328, p. 114183, 2025

2025
[16]

pii-masking-300k (revision 86db63b),

Ai4Privacy, “pii-masking-300k (revision 86db63b),” 2024. [Online]. Available: https://huggingface.co/datasets/ai4privacy/pii-masking-300k

2024

[1] [1]

Cyberq: Generat- ing questions and answers for cybersecurity education using knowledge graph-augmented llms,

G. Agrawal, K. Pal, Y . Deng, H. Liu, and Y .-C. Chen, “Cyberq: Generat- ing questions and answers for cybersecurity education using knowledge graph-augmented llms,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 21, 2024, pp. 23 164–23 172

2024

[2] [2]

CAN-QA: A Question-Answering Benchmark for Reasoning over In-Vehicle CAN Traffic

J. Chen, A. Deevi, O. Gungor, and T. Rosing, “Can-qa: A question- answering benchmark for reasoning over in-vehicle can traffic,”arXiv preprint arXiv:2604.24935, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Aqua-llm: Evaluating accuracy, quantization, and adversarial robustness trade-offs in llms for cybersecurity question answering,

O. Gungor, R. Sood, H. Wang, and T. Rosing, “Aqua-llm: Evaluating accuracy, quantization, and adversarial robustness trade-offs in llms for cybersecurity question answering,”arXiv preprint arXiv:2509.13514, 2025

work page arXiv 2025

[4] [4]

Eager: Edge-aligned llm defense for robust, efficient, and accurate cybersecurity question answering,

O. Gungor, R. Sood, J. Zhou, and T. Rosing, “Eager: Edge-aligned llm defense for robust, efficient, and accurate cybersecurity question answering,”arXiv preprint arXiv:2511.19523, 2025

work page arXiv 2025

[5] [5]

An empirical study of sensitive information in logs,

R. Aghili, H. Li, and F. Khomh, “An empirical study of sensitive information in logs,”arXiv preprint arXiv:2409.11313, 2024

work page arXiv 2024

[6] [6]

Security operations center: A systematic study and open challenges,

M. Vielberth, F. B ¨ohm, I. Fichtinger, and G. Pernul, “Security operations center: A systematic study and open challenges,”Ieee Access, vol. 8, pp. 227 756–227 779, 2020

2020

[7] [7]

Examining the factors that impact the severity of cyberattacks on critical infrastructures,

Y . Roumani and M. Alraee, “Examining the factors that impact the severity of cyberattacks on critical infrastructures,”Computers & Secu- rity, vol. 148, p. 104074, 2025

2025

[8] [8]

Secqa: A concise question-answering dataset for evalu- ating large language models in computer security,

Z. Liu, “Secqa: A concise question-answering dataset for evalu- ating large language models in computer security,”arXiv preprint arXiv:2312.15838, 2023

work page arXiv 2023

[9] [9]

Cybermetric: a benchmark dataset based on retrieval- augmented generation for evaluating llms in cybersecurity knowledge,

N. Tihanyiet al., “Cybermetric: a benchmark dataset based on retrieval- augmented generation for evaluating llms in cybersecurity knowledge,” in2024 IEEE International Conference on Cyber Security and Resilience (CSR). IEEE, 2024, pp. 296–302

2024

[10] [10]

Ctibench: A benchmark for evaluating llms in cyber threat intelligence,

M. T. Alamet al., “Ctibench: A benchmark for evaluating llms in cyber threat intelligence,”Advances in Neural Information Processing Systems, vol. 37, pp. 50 805–50 825, 2024

2024

[11] [11]

Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity,

Z. Liu, J. Shi, and J. F. Buford, “Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity,” inAAAI 2024 Workshop on Artificial Intelligence for Cyber Security, 2024

2024

[12] [12]

Priv-qa: Privacy-preserving question answering for cloud large language models,

G. Li, Y . Zhang, Y . Wang, S. Yan, L. Wang, and T. Wei, “Priv-qa: Privacy-preserving question answering for cloud large language models,” arXiv preprint arXiv:2502.13564, 2025

work page arXiv 2025

[13] [13]

Con- qa: Privacy-preserving qa using cloud llms in contract domain,

A. K. Singh, R. Surya, A. Tripathi, S. Choudhury, and S. Bisane, “Con- qa: Privacy-preserving qa using cloud llms in contract domain,”arXiv preprint arXiv:2509.19925, 2025

work page arXiv 2025

[14] [14]

AttackQA: Development and adoption of a dataset for assisting cybersecurity operations using fine-tuned and open-source LLMs.arXiv:2411.01073, 2024

V . B. Krishna, “Attackqa: Development and adoption of a dataset for assisting cybersecurity operations using fine-tuned and open-source llms,” 2024. [Online]. Available: https://arxiv.org/abs/2411.01073

work page arXiv 2024

[15] [15]

Cyberllama: A fine- tuned large language model for cybersecurity named entity recognition,

H. Zhang, T. Wu, T. Zhu, S. Wen, and Y . Xiang, “Cyberllama: A fine- tuned large language model for cybersecurity named entity recognition,” Knowledge-Based Systems, vol. 328, p. 114183, 2025

2025

[16] [16]

pii-masking-300k (revision 86db63b),

Ai4Privacy, “pii-masking-300k (revision 86db63b),” 2024. [Online]. Available: https://huggingface.co/datasets/ai4privacy/pii-masking-300k

2024