arxiv: 2604.18191 · v1 · submitted 2026-04-20 · 💻 cs.PL

Recognition: unknown

Implementing CPSLint: A Data Validation and Sanitisation Tool for Industrial Cyber-Physical Systems

Mari\"elle Stoelinga, \"Omer Sayilir, Uraz Odyurt, Vadim Zaytsev

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:30 UTC · model grok-4.3

classification 💻 cs.PL

keywords domain-specific languagedata sanitizationcyber-physical systemstime-series datadata preparationdata validationindustrial CPSDSL

0 comments

The pith

CPSLint is a domain-specific language that expresses industrial CPS data sanitization and validation in just a few lines of code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CPSLint as a DSL that raises the abstraction level for preparing raw time-series data from industrial cyber-physical systems, replacing repetitive ad-hoc Python scripts. It targets the common need to clean and validate large datasets before they can enter data-centric workflows, claiming that many CPS cases share similar preparation steps. By making the process concise and accessible to both data scientists and domain experts, the DSL aims to improve readability, reusability, and maintainability. The authors position the tool as publicly available and general enough for any time-series sanitization task. If the approach works, professionals would spend far less time rewriting similar scripts for each new dataset.

Core claim

CPSLint is a DSL designed to support the data preparation process for industrial CPS, where one can express the sanitization of large time-series data collections in just a few lines of code; it leverages the fact that many raw datasets in this domain require similar actions to become suitable for analysis, and it is presented as a publicly available tool applicable to any such case.

What carries the argument

CPSLint, a domain-specific language whose constructs abstract common sanitization and validation operations on CPS time-series data.

Load-bearing premise

Many raw industrial CPS time-series collections share enough similar preparation needs that they can be handled by a single concise DSL rather than case-by-case scripts.

What would settle it

A representative CPS dataset whose required sanitization steps cannot be expressed in CPSLint using only a few lines or that demands substantial additional custom code outside the DSL.

Figures

Figures reproduced from arXiv: 2604.18191 by Mari\"elle Stoelinga, \"Omer Sayilir, Uraz Odyurt, Vadim Zaytsev.

**Figure 1.** Figure 1: depicts an example of a typical CPSLint workflow. The compiler takes a CPSLint specification and a raw CSV file as input. Our implementation then uses these to generate a human-readable Python script, capable of sanitising the raw CSV according to the provided definition. This script can then be run using a Python interpreter to obtain a sanitised CSV. CPSLint compiler Python script Python interpreter CP… view at source ↗

**Figure 2.** Figure 2: Different data compartmentalisation granularities within an execution timeline, visualising repeated phase types during consecutive rounds of tasks/sub-tasks, plus the phases considered for processing consecutive image data. 2.3. Emulating data corruptions CPSLint is capable of both sanitising and compartmentalising time-series data. To be able to systematically evaluate and demonstrate these functionalit… view at source ↗

**Figure 3.** Figure 3: CPSLint language features running in Visual Studio Code. export csv ... to ’output#.csv’ cut when ’UART’ is ’image loader’~; instructs CPSLint to create multiple output CSV files, each covering an unique instance where image loading happens. 3.3. Example programs Here, we will go over a few representative input programs for CPSLint and explain them in detail. Example program 1: out_of_bounds.cps This prog… view at source ↗

**Figure 4.** Figure 4: Different workflows involving parsing and cutting of industrial CPS monitoring data: Traditional ML model training dataset formation flow handling a) “Normal” data and b) “Anomalous” data; c) CNN model training dataset formation flow. Note the presence of parsing and cutting in all such workflows as fundamental and repeating data processing stages. description of its purpose. This can be seen in the exampl… view at source ↗

**Figure 5.** Figure 5: A tombstone diagram of the CPSLint compiler pipeline. Activities flow rightwards, with inspection and inferring of the data structure happening on the left side, followed by Python code generation to deliver the executable code operating on the actual machine trace. Ellipses indicate where the normal data processing flow connects. Note the role of a domain expert refining the inferred CPSLint specification… view at source ↗

**Figure 6.** Figure 6: A tombstone diagram of the CPSLint interpreter: the presence of the interpreter (vertical block) of CPSLint written in Rascal allows us to see CPSLint specifications as transformations from raw CSV to its sanitised form. output. alongside the main out the interpreter also writes the log, and creates intermediate files after each step the interpreter performs during the execution. 4.5. Integration into pipe… view at source ↗

**Figure 7.** Figure 7: Integration options considering the two modes of operation offered by CPSLint, i.e., a) compiler and b) interpreter modes. Among the tools for data wrangling, sanitisation and compartmentalisation, we recognise the following notable examples, each providing various sets of features. GNU datamash [9] is a command-line tool that performs basic numeric, textual, and statistical operations on textual data. Whi… view at source ↗

read the original abstract

Raw datasets are often too large and unstructured to work with directly, and require a data preparation phase. The domain of industrial Cyber-Physical Systems (CPSs) is no exception, as raw data typically consists of large time-series data collections that log the system's status at regular time intervals. The processing of such raw data is often carried out using ad hoc, case-specific, one-off Python scripts, often neglecting aspects of readability, reusability, and maintainability. In practice, this can cause professionals such as data scientists to write similar data preparation scripts for each case, requiring them to do much repetitive work. We introduce CPSLint, a Domain-Specific Language (DSL) designed to support the data preparation process for industrial CPS. CPSLint raises the level of abstraction to the point where both data scientists and domain experts can perform the data preparation task. We leverage the fact that many raw data collections in the industrial CPS domain require similar actions to render them suitable for data-centric workflows. In our DSL one can express the data preparation process in just a few lines of code. CPSLint is a publicly available tool applicable for any case involving time-series data collections in need of sanitisation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces CPSLint, a domain-specific language (DSL) for data validation and sanitization of raw time-series datasets in industrial cyber-physical systems (CPS). It motivates the work by noting that ad-hoc Python scripts for data preparation lead to repetitive work and neglect readability, reusability, and maintainability. CPSLint is claimed to raise the abstraction level so that both data scientists and domain experts can express the preparation process in just a few lines of code, leveraging common sanitization actions across CPS datasets. The tool is presented as publicly available and generally applicable to any time-series data sanitization needs.

Significance. If the DSL successfully captures common CPS data-preparation patterns and delivers the promised conciseness and accessibility, the work could meaningfully reduce repetitive scripting effort, improve collaboration between domain experts and data scientists, and enhance maintainability of industrial data pipelines. Such a contribution would be relevant to the intersection of programming languages and applied CPS engineering.

major comments (2)

The central claim that data preparation 'can be expressed in just a few lines of code' and that the DSL 'raises the level of abstraction' is load-bearing for the usability argument, yet the abstract (and by extension the manuscript summary) provides no syntax examples, supported operations, or concrete code fragments to illustrate this.
The generality assertion that CPSLint is 'applicable for any case involving time-series data collections in need of sanitisation' rests on the premise that 'many raw data collections in the industrial CPS domain require similar actions.' No coverage analysis, set of supported versus unsupported operations, or multiple independent case studies is referenced to substantiate or bound this claim.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the presentation.

read point-by-point responses

Referee: The central claim that data preparation 'can be expressed in just a few lines of code' and that the DSL 'raises the level of abstraction' is load-bearing for the usability argument, yet the abstract (and by extension the manuscript summary) provides no syntax examples, supported operations, or concrete code fragments to illustrate this.

Authors: We agree that the abstract would benefit from a concrete illustration to support the usability claims. While the full manuscript provides the DSL syntax definition, grammar, and multiple usage examples in Sections 3 (Language Design) and 4 (Implementation and Usage), we will revise the abstract to include a short, self-contained code fragment demonstrating a typical sanitization workflow expressed in a few lines. revision: yes
Referee: The generality assertion that CPSLint is 'applicable for any case involving time-series data collections in need of sanitisation' rests on the premise that 'many raw data collections in the industrial CPS domain require similar actions.' No coverage analysis, set of supported versus unsupported operations, or multiple independent case studies is referenced to substantiate or bound this claim.

Authors: The DSL was designed around recurrent sanitization patterns (e.g., outlier detection, missing-value handling, timestamp alignment, and normalization) that we observed across multiple industrial CPS datasets during development. To make this explicit, we will add a new subsection (or table) that lists all supported operations, their parameters, and the domain rationale for inclusion, thereby providing the requested coverage analysis. The current evaluation uses one representative industrial case study; we do not have additional independent case studies available for inclusion. revision: partial

standing simulated objections not resolved

Providing additional independent case studies to further substantiate generality, as this would require access to further proprietary industrial datasets beyond what is currently available.

Circularity Check

0 steps flagged

No circularity: tool description paper with no derivations or self-referential claims

full rationale

The paper is a descriptive implementation report on CPSLint, a DSL for time-series data sanitization in industrial CPS. It states that many datasets require similar actions and that the DSL allows expression in few lines of code, but provides no equations, predictions, fitted parameters, uniqueness theorems, or derivation chains. The generality claim rests on domain observation rather than any reduction to inputs by construction. No self-citations or ansatzes are invoked in a load-bearing way. This is a standard non-circular tool paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption of commonality in CPS data preparation tasks and the utility of a DSL abstraction; no free parameters, mathematical axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5524 in / 1038 out tokens · 41271 ms · 2026-05-10T03:30:15.057428+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 15 canonical work pages

[1]

Mernik, J

M. Mernik, J. Heering, A. M. Sloane, When and How to Develop Domain-Specific Languages, ACM Comput- ing Surveys (2005). doi:10.1145/1118890.1118892

work page doi:10.1145/1118890.1118892 2005
[2]

Odyurt, O

U. Odyurt, O. Sayilir, M. Stoelinga, V. Zaytsev, CPSLint,
[3]

doi:10.5281/zenodo.17406795

work page doi:10.5281/zenodo.17406795
[4]

(2025) Integrated Information Theory: A Consciousness-First Approach to What Exists.https://doi.org/10.48550/arXiv.2510

U. Odyurt, Ömer Sayilir, M. Stoelinga, V. Zaytsev, CPSLint: A Domain-Specific Language Providing Data Validation and Sanitisation for Industrial Cyber- Physical Systems, 2025. doi: 10.48550/arXiv.2510. 18651

work page doi:10.48550/arxiv.2510 2025
[5]

Reactive power opti- mization for voltage stability in energy inter- net based on graph convolutional networks and deep q-learning,

U. Odyurt, J. Roeder, A. D. Pimentel, I. G. Alonso, C. de Laat, Power Passports for Fault Tolerance: Anomaly Detection in Industrial CPS Using Electri- cal EFB, in: 2021 4th IEEE International Conference on Industrial Cyber-Physical Systems (ICPS), 2021. doi:10.1109/ICPS49255.2021.9468262

work page doi:10.1109/icps49255.2021.9468262 2021
[6]

Klint, T

P. Klint, T. van der Storm, J. Vinju, RASCAL: A Domain Specific Language for Source Code Analysis and Ma- nipulation, in: 2009 Ninth IEEE International Working Conference on Source Code Analysis and Manipula- tion, 2009. doi:10.1109/SCAM.2009.28

work page doi:10.1109/scam.2009.28 2009
[7]

Klint, T

P. Klint, T. van der Storm, J. Vinju, EASY Meta- programming with Rascal, in: Generative and Trans- formational Techniques in Software Engineering III: International Summer School, GTTSE 2009, Braga, Por- tugal, July 6-11, 2009. Revised Papers, Springer, 2011. doi:10.1007/978-3-642-18023-1_6

work page doi:10.1007/978-3-642-18023-1_6 2009
[8]

Odyurt, D

U. Odyurt, D. Sapra, A. D. Pimentel, The Choice of AI Matters: Alternative Machine Learning Approaches for CPS Anomalies, in: Advances and Trends in Ar- tificial Intelligence. From Theory to Practice, 2021. doi:10.1007/978-3-030-79463-7_40

work page doi:10.1007/978-3-030-79463-7_40 2021
[9]

Erdweg, T

S. Erdweg, T. van der Storm, M. Völter, L. Tratt, R. Bosman, W. R. Cook, A. Gerritsen, A. Hulshout, S. Kelly, A. Loh, G. D. P. Konat, P. J. Molina, M. Palat- nik, R. Pohjonen, E. Schindler, K. Schindler, R. Solmi, V. A. Vergu, E. Visser, K. van der Vlist, G. Wachsmuth, J. van der Woning, Evaluating and comparing lan- guage workbenches: Existing results an...

work page doi:10.1016/j.cl.2015.08.007 2015
[10]

URL: https://www

GNU Project, GNU datamash, 2025. URL: https://www. gnu.org/software/datamash/

2025
[11]

Hoff, Lisp Query Notation — A DSL for Data Pro- cessing, 2024

A. Hoff, Lisp Query Notation — A DSL for Data Pro- cessing, 2024. doi:10.5281/zenodo.11001584

work page doi:10.5281/zenodo.11001584 2024
[12]

Giner-Miguelez, A

J. Giner-Miguelez, A. Gómez, J. Cabot, A Domain- Specific Language for Describing Machine Learning Datasets, Journal of Computer Languages (2023). doi:10.1016/j.cola.2023.101209

work page doi:10.1016/j.cola.2023.101209 2023
[13]

Heine, C

F. Heine, C. Kleiner, T. Oelsner, A DSL for Au- tomated Data Quality Monitoring, in: Database and Expert Systems Applications, 2020. doi:10.1007/ 978-3-030-59003-1_6

2020
[14]

de la Vega, D

A. de la Vega, D. García-Saiz, M. Zorrilla, P. Sánchez, Lavoisier: A DSL for Increasing the Level of Abstrac- tion of Data Selection and Formatting in Data Mining, Journal of Computer Languages (2020). doi:10.1016/ j.cola.2020.100987

work page arXiv 2020
[15]

B. Sal, D. García-Saiz, A. de la Vega, P. Sánchez, Domain-Specific Languages for the Automated Gen- eration of Datasets for Industry 4.0 Applications, Journal of Industrial Information Integration (2024). doi:10.1016/j.jii.2024.100657

work page doi:10.1016/j.jii.2024.100657 2024
[16]

Ackermann, V

S. Ackermann, V. Jovanovic, T. Rompf, M. Odersky, Jet: An Embedded DSL for High Performance Big Data Pro- cessing, 2012. URL: https://infoscience.epfl.ch/handle/ 20.500.14299/85985

2012
[17]

Vogel-Heuser, M

B. Vogel-Heuser, M. Zhang, M. Krüger, A. Vicaria, M. Gardill, Y. Jiang, A. Trächtler, H. Peters, M. Liewald, A. Schenek, P. Heinzelmann, M. Weyrich, DSL4DPiFS — A Graphical Notation to Model Data Pipeline De- ployment in Forming Systems, at - Automatisierung- stechnik (2025). doi:10.1515/auto-2024-0114

work page doi:10.1515/auto-2024-0114 2025
[18]

M. R. Berthold, N. Cebron, F. Dill, T. R. Gabriel, T. Köt- ter, T. Meinl, P. Ohl, K. Thiel, B. Wiswedel, KN- IME — the Konstanz information miner: version 2.0 and beyond, SIGKDD Explorations Newsletter (2009). doi:10.1145/1656274.1656280

work page doi:10.1145/1656274.1656280 2009
[19]

Odyurt, R

U. Odyurt, R. Loendersloot, T. Tinga, Demonstrators for Industrial Cyber-Physical System Research: A Re- quirements Hierarchy Driven by Software-Intensive Design, 2026. doi:10.48550/arXiv.2510.18534

work page doi:10.48550/arxiv.2510.18534 2026