pith. sign in

arxiv: 2606.12225 · v1 · pith:O3COQ4FPnew · submitted 2026-06-10 · 💻 cs.CR

Bridging the Smart City Cybersecurity Data Gap Through AI-Driven Synthetic Dataset Generation

Pith reviewed 2026-06-27 09:16 UTC · model grok-4.3

classification 💻 cs.CR
keywords cybersecuritysmart citiessynthetic data generationgenerative AIIoT securitydataset creationattack simulationnetwork security
0
0 comments X

The pith

A generative AI framework produces synthetic cybersecurity datasets that replicate smart city device behaviors, network interactions, and attack scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Smart cities depend on interconnected sensors and IoT systems that create large attack surfaces, yet real datasets for testing defenses are often private, incomplete, or short on malicious activity. The paper proposes an AI-based synthetic data generation framework that uses generative models to build high-fidelity alternatives matching realistic conditions. These datasets are checked for protocol conformity, statistical resemblance to originals, and usefulness inside standard security tools. If the approach holds, researchers gain the ability to develop and validate threat models and defensive methods without the usual barriers of data access.

Core claim

The paper claims that an AI-based synthetic data generation framework leveraging generative artificial intelligence models can produce high-fidelity synthetic cybersecurity datasets that replicate realistic device behaviors, network interactions, and cyber-attack scenarios for smart cities, with the resulting data evaluated for conformity to protocol standards, statistical similarity to original datasets, and utility in common security tools to advance threat modeling and defense evaluation.

What carries the argument

The AI-based synthetic data generation (SDG) framework that uses generative artificial intelligence models to create datasets replicating device behaviors, network interactions, and attack scenarios

If this is right

  • Researchers gain the ability to model smart city threats more effectively using accessible data.
  • Defensive techniques can be evaluated more comprehensively across varied attack scenarios.
  • Critical smart city infrastructures receive improved protection through better-tested cybersecurity methods.
  • Synthetic datasets conform to protocol standards and maintain statistical similarity to real data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the synthetic data proves effective in tools, it could reduce dependence on privacy-restricted real collections for training detection systems.
  • The same generation approach might apply to other data-scarce cyber-physical domains such as industrial control systems.
  • A practical next step would be to measure how well models trained solely on the synthetic sets detect novel attack variants in live deployments.

Load-bearing premise

Generative AI models can be trained or prompted to output data that passes statistical similarity checks and proves useful in common security tools.

What would settle it

A direct comparison in which a security tool trained on the synthetic data shows markedly lower performance on actual smart city network traces than on the synthetic traces themselves.

Figures

Figures reproduced from arXiv: 2606.12225 by John D. Hastings, Kyle Korman, Stephanie Polczynski, Varghese Vaidyan.

Figure 1
Figure 1. Figure 1: Overview of the AI-based Synthetic Dataset Generation Framework [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
read the original abstract

Smart cities rely on interconnected cyber-physical systems that integrate sensors, IoT devices, cloud platforms, and AI-driven services and decision-making. While these systems enhance city services, they also introduce complex cybersecurity challenges due to their large attack surfaces, heterogeneous data flows, and evolving threat vectors. Developing and validating cybersecurity tools for smart cities requires high-quality datasets that accurately represent real operational conditions. However, real-world datasets are often incomplete, contain privacy-sensitive data, are difficult to access, or lack sufficient malicious activity to support tool development. This research addresses this critical gap by proposing an AI-based synthetic data generation (SDG) framework designed specifically for smart city cybersecurity research. The proposed framework leverages generative artificial intelligence models to produce high-fidelity synthetic cybersecurity datasets that replicate realistic device behaviors, network interactions, and cyber-attack scenarios. The synthetic datasets are evaluated for conformity to protocol standards, statistical similarity to original datasets, and utility in common security tools. The resulting synthetic data generation framework and evaluation metrics are expected to advance smart city cybersecurity by enabling researchers to model threats more effectively and evaluate defensive techniques more comprehensively to better protect critical smart city infrastructures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes an AI-based synthetic data generation (SDG) framework for smart city cybersecurity research. It claims that generative AI models can be used to produce high-fidelity synthetic datasets replicating realistic device behaviors, network interactions, and cyber-attack scenarios. These datasets are to be evaluated for conformity to protocol standards, statistical similarity to original datasets, and utility in common security tools, with the expectation that the framework will advance threat modeling and defensive technique evaluation in smart city infrastructures.

Significance. If a concrete implementation of the proposed framework were developed and shown to meet the stated evaluation criteria, it could meaningfully address data scarcity and privacy issues in smart city cybersecurity research, enabling broader experimentation with defensive tools. The core idea of using generative models for this purpose aligns with existing needs in the field, though the manuscript provides no evidence that the approach is feasible or novel relative to prior synthetic data work in cybersecurity.

major comments (3)
  1. [Abstract] Abstract: The framework is presented only as a high-level proposal with no specification of the generative models (e.g., GAN, VAE, transformer, or diffusion variants), conditioning inputs, architecture, or training procedure. This is load-bearing because the central claim that the SDG pipeline produces high-fidelity data replicating device behaviors and attacks rests entirely on the unstated assumption that such models can be configured to succeed.
  2. [Abstract] Abstract: No training corpus, loss functions, or quantitative results are supplied for any of the three evaluation axes (protocol conformity, statistical similarity, downstream utility in security tools). The text uses only forward-looking language ('are expected to', 'will advance') rather than demonstrated outcomes, leaving the load-bearing assumption that the pipeline meets these criteria untested.
  3. [Abstract] Abstract: The manuscript contains no derivation, pseudocode, or preliminary validation showing that synthetic data can pass the required checks for smart-city traffic and attacks; without these elements the proposal cannot be assessed for internal consistency or feasibility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the detailed review and constructive feedback on our manuscript. We agree that the work is presented as a high-level conceptual proposal and will use the comments to strengthen the description of the framework.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The framework is presented only as a high-level proposal with no specification of the generative models (e.g., GAN, VAE, transformer, or diffusion variants), conditioning inputs, architecture, or training procedure. This is load-bearing because the central claim that the SDG pipeline produces high-fidelity data replicating device behaviors and attacks rests entirely on the unstated assumption that such models can be configured to succeed.

    Authors: We agree that the manuscript describes the framework at a conceptual level without specifying particular generative models, conditioning inputs, or training procedures. The current version focuses on the overall pipeline and evaluation strategy rather than implementation details. We will revise to include example model selections (e.g., conditional GANs or diffusion models for traffic and attack generation), conditioning on device types and protocols, and high-level architecture and training considerations. revision: yes

  2. Referee: [Abstract] Abstract: No training corpus, loss functions, or quantitative results are supplied for any of the three evaluation axes (protocol conformity, statistical similarity, downstream utility in security tools). The text uses only forward-looking language ('are expected to', 'will advance') rather than demonstrated outcomes, leaving the load-bearing assumption that the pipeline meets these criteria untested.

    Authors: The referee is correct that no specific training corpora, loss functions, or quantitative results are provided. As this is a framework proposal rather than an empirical study, the manuscript does not include implemented results. In a revision we will specify example training sources (public IoT and smart-city datasets), suggest loss functions aligned with the three evaluation axes, and clarify that empirical outcomes are intended as future work. revision: yes

  3. Referee: [Abstract] Abstract: The manuscript contains no derivation, pseudocode, or preliminary validation showing that synthetic data can pass the required checks for smart-city traffic and attacks; without these elements the proposal cannot be assessed for internal consistency or feasibility.

    Authors: We acknowledge the absence of pseudocode, derivations, or preliminary validation. The manuscript is a high-level proposal, so these elements were not included. We will revise to add high-level pseudocode for the SDG pipeline and discuss feasibility based on related synthetic data literature for cybersecurity, while noting that concrete validation requires implementation. revision: yes

standing simulated objections not resolved
  • The manuscript contains no implementation or empirical results, so actual quantitative outcomes, trained models, or full validation data cannot be supplied in response to requests for demonstrated performance.

Circularity Check

0 steps flagged

No circularity: high-level proposal with no derivations or self-referential claims

full rationale

The manuscript is a framework proposal that states an intent to use generative AI for synthetic datasets and lists evaluation criteria (protocol conformity, statistical similarity, utility) but supplies no equations, model architectures, training procedures, fitted parameters, or derivations. No self-citations appear in the provided text, and no step reduces a claimed result to its own inputs by construction. The central claim remains an unformalized expectation rather than a derived prediction, so no circularity patterns apply.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities are named or quantified in the provided text.

pith-pipeline@v0.9.1-grok · 5736 in / 1096 out tokens · 19076 ms · 2026-06-27T09:16:43.555826+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 21 canonical work pages

  1. [1]

    [Online]

    Grand View Research,Smart cities market size, share — global industry report, 2019-2025, 2025. [Online]. Available: https://www. grandviewresearch.com/industry-analysis/smart-cities-market

  2. [2]

    International Telecommunication Union,Digital transformation for people-centered cities, Sep. 2022. Accessed: May 31, 2026. [Online]. Available: https://www.itu.int/cities/about/

  3. [3]

    Woetzel,Smart city technology for a more liveable future, Jun

    J. Woetzel,Smart city technology for a more liveable future, Jun. 2018. [Online]. Available: https://www.mckinsey.com/capabilities/operations/ our-insights/smart-cities-digital-solutions-for-a-more-livable-future

  4. [4]

    Lea,Smart Cities: An Overview of the Technology Trends Driving Smart Cities

    R. Lea,Smart Cities: An Overview of the Technology Trends Driving Smart Cities. 2017. [Online]. Available: https://web.archive.org/web/ 20251207164621/https://www.ieee.org/content/dam/ieee-org/ieee/web/ org/about/corporate/ieee- industry- advisory- board/ieee- smart- cities- trend-paper-2017.pdf

  5. [5]

    Analysis of smart cities security: Challenges and advancements,

    M. Houichi, F. Jaidi, and A. Bouhoula, “Analysis of smart cities security: Challenges and advancements,” in2022 15th International Conference on Security of Information and Networks (SIN), 2022, pp. 01–05.DOI: 10.1109/SIN56466.2022.9970494

  6. [6]

    Smart city: The state of the art, datasets, and evaluation platforms,

    S. Mallapuram, N. Ngwum, F. Yuan, C. Lu, and W. Yu, “Smart city: The state of the art, datasets, and evaluation platforms,” inIEEE/ACIS 16th International Conference on Computer & Information Science (ICIS), IEEE, 2017, 447–452.DOI: 10.1109/ICIS.2017.7960034

  7. [7]

    Data sets, modeling, and decision making in smart cities: A survey,

    M. Ma, S. M. Preum, M. Y . Ahmed, W. T ¨arneberg, A. Hendawi, and J. A. Stankovic, “Data sets, modeling, and decision making in smart cities: A survey,” en,ACM Transactions on Cyber-Physical Systems, vol. 4, no. 2, 1–28, Apr. 2020.DOI: 10.1145/3355283

  8. [8]

    Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications for centralized and federated learning,

    M. A. Ferrag, O. Friha, D. Hamouda, L. Maglaras, and H. Janicke, “Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications for centralized and federated learning,”IEEE Access, vol. 10, 40281–40306, 2022.DOI: 10.1109/ACCESS.2022. 3165809

  9. [9]

    WUSTL-IIOT-2021 dataset for IIoT cybersecurity research,

    M. Zolanvari, M. A. Teixeira, L. Gupta, K. M. Khan, and R. Jain, “WUSTL-IIOT-2021 dataset for IIoT cybersecurity research,” 2021. [Online]. Available: https://www.cse.wustl.edu/∼jain/iiot2/

  10. [10]

    X-iiotid: A connectivity-agnostic and device-agnostic intrusion data set for indus- trial internet of things,

    M. Al-Hawawreh, E. Sitnikova, and N. Aboutorab, “X-iiotid: A connectivity-agnostic and device-agnostic intrusion data set for indus- trial internet of things,”IEEE Internet of Things Journal, vol. 9, no. 5, 3962–3977, Mar. 2022.DOI: 10.1109/JIOT.2021.3102056

  11. [11]

    UNSW-NB15: A comprehensive data set for network intrusion detection systems (unsw-nb15 network data set),

    N. Moustafa and J. Slay, “UNSW-NB15: A comprehensive data set for network intrusion detection systems (unsw-nb15 network data set),” in 2015 Military Communications and Information Systems Conference (MilCIS), Nov. 2015, 1–6.DOI: 10.1109/MilCIS.2015.7348942

  12. [12]

    In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2025

    N. Koroniotis, N. Moustafa, E. Sitnikova, and J. Slay, “Towards developing network forensic mechanism for botnet activities in the iot based on machine learning techniques,” inMobile Networks and Management, vol. 235, Springer, 2018, 30–44.DOI: 10.1007/978-3- 319-90775-8 3

  13. [13]

    Federated TON IoT Windows Datasets for Evaluating AI-Based Security Ap- plications,

    N. Moustafa, M. Keshky, E. Debiez, and H. Janicke, “Federated TON IoT Windows Datasets for Evaluating AI-Based Security Ap- plications,” in2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Guangzhou, China: IEEE, Dec. 2020, 848–855.DOI: 10 . 1109 / TrustCom50675.2020.00114

  14. [14]

    Liu et al.,Best practices and lessons learned on synthetic data,

    R. Liu et al.,Best practices and lessons learned on synthetic data,

  15. [15]

    Privacy mechanisms and evaluation metrics for synthetic data generation: A systematic review,

    P. A. Osorio-Marulanda, G. Epelde, M. Hernandez, I. Isasa, N. M. Reyes, and A. B. Iraola, “Privacy mechanisms and evaluation metrics for synthetic data generation: A systematic review,”IEEE Access, vol. 12, 88048–88074, 2024.DOI: 10.1109/ACCESS.2024.3417608

  16. [16]

    Synthetic data generation models for time series: A literature review,

    D. Viana, R. Teixeira, J. Baptista, and T. Pinto, “Synthetic data generation models for time series: A literature review,” in2024 Inter- national Conference on Electrical, Computer and Energy Technologies (ICECET, Jul. 2024, 1–6.DOI: 10.1109/ICECET61485.2024.10698494

  17. [17]

    and Silva, Jéssica Alice A

    K. Wang and M. Govindarasu, “Fgsm-based synthetic data generation technique and application to anomaly detection in smart grid,” in2024 IEEE Power & Energy Society General Meeting (PESGM), Jul. 2024, 1–5.DOI: 10.1109/PESGM51994.2024.10688539

  18. [18]

    Synthetic training-data gen- eration for ml-based process mining tools,

    A. Singh, Z. Bettouche, and A. Fischer, “Synthetic training-data gen- eration for ml-based process mining tools,” in2024 14th International Conference on Advanced Computer Information Technologies (ACIT), Sep. 2024, 705–709.DOI: 10.1109/ACIT62333.2024.10712516

  19. [19]

    2024, 10.1109/BigData62323.2024.10825388

    I. Tenison, A. Chen, N. Singh, O. Dahleh, E. Zemour, and L. Kagal, “Private synthetic data generation for mixed type datasets,” in2024 IEEE International Conference on Big Data (BigData), Dec. 2024, 6379–6386.DOI: 10.1109/BigData62323.2024.10825249

  20. [20]

    Elaborate synthetic data generation for internet of things services at smart home environment,

    R. Myung, S. Choi, W. Choi, H. Yu, D. Lee, and E. Lee, “Elaborate synthetic data generation for internet of things services at smart home environment,” in2016 International Conference on Computational Science and Computational Intelligence (CSCI), Dec. 2016, 226–229. DOI: 10.1109/CSCI.2016.0050

  21. [21]

    Synthetic packet traffic generative adversarial networks in multi agents with peer-to-peer and global priority queue generation,

    C.-L. Wu, Y .-Y . Chen, P.-Y . Chou, and C.-Y . Wang, “Synthetic packet traffic generative adversarial networks in multi agents with peer-to-peer and global priority queue generation,”IEEE Transactions on Network Science and Engineering, vol. 13, 5851–5869, 2026,ISSN: 2327-4697. DOI: 10.1109/TNSE.2026.3653576

  22. [22]

    Iotgemini: Modeling iot network behaviors for synthetic traffic generation,

    R. Li et al., “Iotgemini: Modeling iot network behaviors for synthetic traffic generation,”IEEE Transactions on Mobile Computing, vol. 23, no. 12, 13240–13257, Dec. 2024,ISSN: 1558-0660.DOI: 10 . 1109 / TMC.2024.3426600

  23. [23]

    Toward synthetic network traffic generating in ntn- enabled iot: A generative ai approach,

    D. Jiang et al., “Toward synthetic network traffic generating in ntn- enabled iot: A generative ai approach,”IEEE Internet of Things Journal, vol. 12, no. 2, 2174–2187, Jan. 2025,ISSN: 2327-4662.DOI: 10.1109/JIOT.2024.3468209

  24. [24]

    A tale of two methods: Unveiling the limitations of gan and the rise of bayesian networks for synthetic network traffic generation,

    A. Schoen, G. Blanc, P.-F. Gimenez, Y . Han, F. Majorczyk, and L. Me, “A tale of two methods: Unveiling the limitations of gan and the rise of bayesian networks for synthetic network traffic generation,” in 2024 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), ISSN: 2768-0657, Jul. 2024, 273–286.DOI: 10.1109/ EuroSPW61312.2024.00036

  25. [25]

    Explainable ai for network threat detection: Isolation forests and synthetic wifi traffic,

    S. Fioretto, E. Masciari, and E. V . Napolitano, “Explainable ai for network threat detection: Isolation forests and synthetic wifi traffic,” in 2025 IEEE/ACS 22nd International Conference on Computer Systems and Applications (AICCSA), ISSN: 2161-5330, Oct. 2025, 1–5.DOI: 10.1109/AICCSA66935.2025.11315249

  26. [26]

    Bavdekar, E

    N.-T. Nguyen, T.-N. Le, K.-H. Le-Minh, and K.-H. Le, “Towards generating semi-synthetic datasets for network intrusion detection system,” in2023 International Conference on Information Networking (ICOIN), Jan. 2023, 62–66.DOI: 10.1109/ICOIN56518.2023.10048962

  27. [27]

    C. Task, K. Bhagat, and G. Howarth,SDNist: Deidentified Data Report Tool. Apr. 2023.DOI: 10.18434/mds2-2943 [Online]. Available: https: //github.com/usnistgov/SDNist [28]SDMetrics, en, Sep. 2025. [Online]. Available: https://docs.sdv.dev/ sdmetrics [29]SynthEval. schneiderkamplab, Oct. 2025. [Online]. Available: https : //github.com/schneiderkamplab/syntheval

  28. [28]

    How good is your synthetic data? synthro, a dashboard to evaluate and bench- mark synthetic tabular data,

    G. Santangelo, G. Nicora, R. Bellazzi, and A. Dagliati, “How good is your synthetic data? synthro, a dashboard to evaluate and bench- mark synthetic tabular data,”BMC Medical Informatics and Decision Making, vol. 25, no. 1, p. 89, 2025.DOI: 10.1186/s12911-024-02731-9

  29. [29]

    Mario Stefanelli

    BMI “Mario Stefanelli” Lab - UNIPV,SynthRO, Jul. 2025. [Online]. Available: https://github.com/bmi-labmedinfo/SynthRO

  30. [30]

    Employing generative adversarial networks for secure and reliable synthetic data generation in cyber security applications,

    V . U. Krishnan, R. Dhumpati, V . E. Salis, M. B. K, K. Sutaria, and G. Abhyankar, “Employing generative adversarial networks for secure and reliable synthetic data generation in cyber security applications,” in2025 4th International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE), Apr. 2025, 1–5.DOI: 10.1109/ICDCECE6...

  31. [31]

    Architectural selection framework for synthetic network traffic: Quantifying the fidelity–utility trade-off,

    D. A. Ammara, J. Ding, and K. Tutschku, “Architectural selection framework for synthetic network traffic: Quantifying the fidelity–utility trade-off,”IEEE Access, vol. 14, 468–484, 2026,ISSN: 2169-3536. DOI: 10.1109/ACCESS.2025.3646769

  32. [32]

    Idaho National Labs,Malcolm, Feb. 2026. [Online]. Available: https: //github.com/idaholab/Malcolm