pith. machine review for the scientific record. sign in

arxiv: 2604.09277 · v1 · submitted 2026-04-10 · 💻 cs.DB

Recognition: unknown

A Catalog of Data Errors

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:36 UTC · model grok-4.3

classification 💻 cs.DB
keywords data errorserror taxonomytabular datadata qualitydata cleaningmissing valuesoutliersbias
0
0 comments X

The pith

A catalog defines 35 distinct error types for tabular data in three non-overlapping categories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors compile a list of data errors that harm databases and applications like machine learning. They include both obvious issues such as missing values and duplicates and less obvious ones like disguised missing values or word transpositions, along with statistical problems like outliers and bias. The catalog places these into three separate groups and gives each a formal definition plus an example. It also straightens out conflicting terms from previous informal lists. This matters because better organized knowledge of errors lets people build more effective tools to find and fix problems in data.

Core claim

The paper establishes a comprehensive catalog containing 35 distinct error types for tabular data. These are classified into three non-overlapping categories of missing, incorrect, and redundant. For every type the catalog supplies a formal definition and a practical example while also resolving inconsistencies in terminology from earlier research.

What carries the argument

The three-category classification of error types into missing, incorrect, and redundant, which organizes the 35 types and supports targeted detection and cleaning approaches.

If this is right

  • Data quality tools can implement error-specific detection and cleaning strategies for each listed type.
  • Researchers gain a standardized way to address both traditional data errors and statistical indicators such as bias.
  • Terminological inconsistencies across related work are resolved, improving clarity in the field.
  • The catalog supports systematic handling of errors that arise in database design and operation phases.
  • Practitioners can better prepare data for downstream tasks like analytics reports and machine learning pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The catalog might inspire similar classifications for non-tabular data formats such as graphs or text collections.
  • It could lead to benchmarks that measure how well tools handle the full range of these 35 error types.
  • Future studies might test whether using this taxonomy improves the performance of automated data cleaning systems.
  • Connections to error propagation through entire analysis pipelines could be explored using the categories.

Load-bearing premise

That the 35 error types are truly distinct, that the three categories are non-overlapping, and that the categories together cover every possible error in tabular data.

What would settle it

Discovery of a concrete error in tabular data that cannot be assigned to exactly one of the three categories without overlap or omission would falsify the catalog's claims.

Figures

Figures reproduced from arXiv: 2604.09277 by Divesh Srivastava, Divya Bhadauria, Felix Naumann, Hazar Harmouch, Lisa Ehrlinger.

Figure 1
Figure 1. Figure 1: Entity-relationship diagram for the running example: [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example instances of Employee, Certificate, Employeecertificate, and Department relations. Grayed-out tuples indicate the corresponding correct values of each attribute for a real-world element, indicating that the database values in those tuples are incorrect. For instance, if a tuple 𝑡 is incorrect, we represent the real-world values in a greyed-out tuple 𝑡 ′ . component is irrelevant, we omit trailing s… view at source ↗
Figure 3
Figure 3. Figure 3: A hierarchy of data error types (indicated with blue background) and error indicators (indicated with gray background). [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Data errors are widespread in real-world databases and severely impact downstream applications, such as machine learning pipelines or business analytics reports. Causes of such errors are manifold and can arise during both the design phase and the operational phase of a database. Some error types, such as missing values, duplicate tuples, or constraint violations, are widely recognized; others, such as disguised missing values or word transpositions, remain underexplored. Existing attempts to define and classify errors in data offer valuable but limited taxonomies, mostly informal and not covering the full range of error types. With the rise of AI, practitioners must increasingly detect and correct statistical errors such as bias and outliers, which are rarely considered within existing error taxonomies. This catalog presents a comprehensive list of 35 distinct error types, including both data errors (e.g., missing values, duplicate tuples) and error indicators (e.g., outliers, bias) for tabular data, classified into three non-overlapping categories: missing, incorrect, and redundant. For each error type, we provide a formal definition and practical example, and resolve terminological inconsistencies across related work. Our catalog enables researchers and practitioners to address various error types and systematically implement error-specific detection and cleaning strategies in data quality tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a synthesized catalog of 35 distinct error types for tabular data, partitioned into three non-overlapping categories (missing, incorrect, and redundant). Each type is supplied with a formal definition and a practical example; the work also claims to resolve terminological inconsistencies across prior literature and to incorporate statistical error indicators (e.g., outliers, bias) relevant to AI pipelines.

Significance. If the claimed non-overlapping and comprehensive classification holds, the catalog could become a useful reference for standardizing terminology and guiding the implementation of error-specific detection and cleaning routines in data-quality tools and ML pipelines. The shift from informal taxonomies to formally defined entries is a positive step.

major comments (2)
  1. [Classification and Definitions] The central claim that the three categories are non-overlapping and that the 35 types are distinct is asserted by construction via the supplied definitions, yet no explicit boundary-case analysis or overlap-resolution procedure is provided. For example, 'disguised missing values' could plausibly be classified under both missing and incorrect depending on the chosen definition; this needs a dedicated subsection demonstrating mutual exclusivity.
  2. [Error Indicators subsection] The inclusion of error indicators such as 'outliers' and 'bias' alongside traditional data errors (e.g., duplicate tuples, constraint violations) is load-bearing for the comprehensiveness claim, but the manuscript does not clarify whether these are treated as errors themselves or as symptoms that may indicate other error types. This distinction affects downstream use in cleaning strategies and should be justified with a short decision tree or decision criteria.
minor comments (2)
  1. A single summary table listing all 35 types, their category, a one-sentence definition, and key literature references would greatly improve usability and allow readers to verify the claimed resolution of terminological inconsistencies.
  2. The abstract mentions 'word transpositions' as an underexplored type; ensure it appears in the catalog with its formal definition and that its placement under one of the three categories is unambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of the catalog. We address each major point below and will incorporate revisions to strengthen the presentation of the classification and the role of error indicators.

read point-by-point responses
  1. Referee: The central claim that the three categories are non-overlapping and that the 35 types are distinct is asserted by construction via the supplied definitions, yet no explicit boundary-case analysis or overlap-resolution procedure is provided. For example, 'disguised missing values' could plausibly be classified under both missing and incorrect depending on the chosen definition; this needs a dedicated subsection demonstrating mutual exclusivity.

    Authors: The definitions are constructed to enforce mutual exclusivity by focusing on the primary characteristic of each error: missing errors concern the absence or non-representation of data (with disguised missing values defined as placeholders that signal absence rather than incorrect content), incorrect errors involve present values that deviate from ground truth, and redundant errors involve unnecessary repetition. While this structure supports the non-overlapping claim, we agree that an explicit boundary-case analysis would improve clarity. We will add a dedicated subsection that examines potential overlaps, including disguised missing values, and outlines a resolution procedure based on the dominant error characteristic. revision: yes

  2. Referee: The inclusion of error indicators such as 'outliers' and 'bias' alongside traditional data errors (e.g., duplicate tuples, constraint violations) is load-bearing for the comprehensiveness claim, but the manuscript does not clarify whether these are treated as errors themselves or as symptoms that may indicate other error types. This distinction affects downstream use in cleaning strategies and should be justified with a short decision tree or decision criteria.

    Authors: The manuscript positions outliers and bias as error indicators that are themselves catalogued as distinct types because they directly affect data quality and downstream AI pipelines, even when they may arise from or point to other issues. This is distinct from treating them solely as symptoms. To make the distinction explicit and support practical use, we will expand the Error Indicators subsection with decision criteria that differentiate primary error indicators from potential symptoms, including implications for cleaning strategies. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a literature synthesis that compiles and formally defines 35 error types for tabular data, partitioned into the three categories of missing, incorrect, and redundant. No equations, predictions, fitted parameters, or deductive derivations appear in the work. The asserted properties of distinctness and non-overlap follow directly from the explicit definitions supplied for each type, which is a standard definitional construction rather than a reduction of any claim to its own inputs. All source material is drawn from external prior taxonomies; the paper's contribution is organization and terminological clarification, with no self-citation chains or uniqueness theorems invoked as load-bearing premises.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The catalog is constructed from existing literature on data errors, without introducing new free parameters, axioms, or invented entities; it synthesizes and formalizes known concepts.

pith-pipeline@v0.9.0 · 5523 in / 1074 out tokens · 35297 ms · 2026-05-10T16:36:17.185273+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

128 extracted references · 89 canonical work pages

  1. [1]

    Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang

    Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang

  2. [2]

    doi:10.14778/2994509.2994518

    Detecting Data Errors: Where Are We and What Needs to Be Done?Proceedings of the VLDB Endowment (PVLDB)9, 12 (2016), 993–1004. doi:10.14778/2994509.2994518

  3. [3]

    Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: a survey.VLDB Journal24, 4 (2015), 557–581. doi:10.1007/ S00778-015-0389-Y

  4. [4]

    1995.Foundations of Databases

    Serge Abiteboul, Richard Hull, and Victor Vianu. 1995.Foundations of Databases. Addison-Wesley. http://webdam.inria.fr/Alice/

  5. [5]

    Ashish Agrawal. 2011. Semantics of business process vocabulary and process rules. InProceedings of the India Software Engineering Conference (Thiruvananthapuram, Kerala, India)(ISEC). ACM, New York, NY, USA, 61–68. doi:10.1145/1953355.1953363

  6. [6]

    Sarah Alsufyani, Matthew Forshaw, and Sara Johansson Fernstad. 2024. Visualization of missing data: a state-of-the-art survey.CoRRabs/2410.03712 (2024). arXiv:2410.03712 doi:10.48550/ARXIV.2410.03712

  7. [7]

    Lyublena Antova, Christoph Koch, and Dan Olteanu. 2007. From complete to incomplete information and back. InProceedings of the International Conference on Management of Data (SIGMOD). ACM, New York, NY, USA, 713–724. doi:10.1145/1247480.1247559

  8. [8]

    Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. Methodologies for data quality assessment and improvement. Comput. Surveys41, 3, Article 16 (2009). doi:10.1145/1541880.1541883

  9. [9]

    Maria C. M. Batista and Ana Carolina Salgado. 2007. Information Quality Measurement in Data Integration Schemas. InProceedings of the International Workshop on Quality in Databases (QDB). ACM, New York, NY, USA, 61–72

  10. [10]

    Michal Bechny, Florian Sobieczky, Jürgen Zeindl, and Lisa Ehrlinger. 2021. Missing Data Patterns: From Theory to an Application in the Steel Industry. InProceedings of the International Conference on Scientific and Statistical Database Management (SSDBM). ACM, New York, NY, USA, 214–219. doi:10.1145/3468791.3468841

  11. [11]

    Ilyas, and Lukasz Golab

    George Beskales, Ihab F. Ilyas, and Lukasz Golab. 2010. Sampling the Repairs of Functional Dependency Violations under Hard Constraints. Proceedings of the VLDB Endowment (PVLDB)3, 1 (2010), 197–207. doi:10.14778/1920841.1920870

  12. [12]

    Divya Bhadauria, Alejandro Sierra Múnera, and Ralf Krestel. 2024. The Effects of Data Quality on Named Entity Recognition. InProceedings of the Workshop on Noisy and User-generated Text (W-NUT). Association for Computational Linguistics, San˙Giljan, Malta, 79–88. doi:10.18653/v1/2024.wnut- 1.8

  13. [13]

    Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2007. Conditional Functional Dependencies for Data Cleaning. InProceedings of the International Conference on Data Engineering, ICDE. IEEE Computer Society, Los Alamitos, CA, USA, 746–755. doi:10.1109/ICDE.2007.367920

  14. [14]

    Orlando Amaral Cejas, Muhammad Ilyas Azeem, Sallam Abualhaija, and Lionel C. Briand. 2023. NLP-Based Automated Compliance Checking of Data Processing Agreements Against GDPR.IEEE Transactions on Software Engineering49, 9 (2023), 4282–4303. doi:10.1109/TSE.2023.3288901

  15. [15]

    Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey.Comput. Surveys41, 3, Article 15 (July 2009), 58 pages. doi:10.1145/1541880.1541882

  16. [16]

    Mingqing Chen, Rajiv Mathews, Tom Ouyang, and Françoise Beaufays. 2019. Federated Learning Of Out-Of-Vocabulary Words.CoRRabs/1903.10635 (2019). arXiv:1903.10635

  17. [17]

    Chrisman

    Nicholas R. Chrisman. 1983. The Role of Quality Information in the Long-Term Functioning of a Geographic Information System.Cartographica: The International Journal for Geographic Information and Geovisualization21, 2 (1983), 79–88

  18. [18]

    Xu Chu and Ihab F. Ilyas. 2016. Qualitative Data Cleaning.Proceedings of the VLDB Endowment (PVLDB)9, 13 (2016), 1605–1608. doi:10.14778/ 3007263.3007320 Manuscript submitted to ACM 30 Bhadauria et al

  19. [19]

    Ilyas, Sanjay Krishnan, and Jiannan Wang

    Xu Chu, Ihab F. Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data Cleaning: Overview and Emerging Challenges. InProceedings of the International Conference on Management of Data (SIGMOD). ACM, 2201–2206. doi:10.1145/2882903.2912574

  20. [20]

    Gianluca Cima, Marco Console, and Maurizio Lenzerini. 2025. Ontology-Based Schema-Level Data Quality: The Case of Consistency.Journal on Data and Information Quality (JDIQ)17, 4, Article 22 (dec 2025), 25 pages. doi:10.1145/3770750

  21. [21]

    Andrea Colagrossi, Vincenzo Pesce, Stefano Silvestrini, David Gonzalez-Arjona, Pablo Hermosin, and Matteo Battilana. 2023. Chapter Six - Sensors. InModern Spacecraft Guidance, Navigation, and Control. Elsevier, 253–336. doi:10.1016/B978-0-323-90916-7.00006-8

  22. [22]

    Giuseppe Colavito, Filippo Lanubile, Nicole Novielli, and Luigi Quaranta. 2024. Impact of data quality for automatic issue classification using pre-trained language models.J. Syst. Softw.210 (2024), 111838. doi:10.1016/J.JSS.2023.111838

  23. [23]

    Cox, and Anthony D

    Dena Cox, Jeffrey G. Cox, and Anthony D. Cox. 2017. To Err is human? How typographical and orthographical errors affect perceptions of online reviewers.Comput. Hum. Behav.75 (2017), 245–253. doi:10.1016/J.CHB.2017.05.008

  24. [24]

    Paul G. Curran. 2016. Methods for the detection of carelessly invalid responses in survey data.Journal of Experimental Social Psychology66 (2016), 4–19. doi:10.1016/j.jesp.2015.07.006 Rigorous and Replicable Methods in Social Psychology

  25. [25]

    Fred J. Damerau. 1964. A technique for computer detection and correction of spelling errors.Commun. ACM7, 3 (March 1964), 171–176. doi:10.1145/363958.363994

  26. [26]

    Hendrik Decker. 2011. Causes of the Violation of Integrity Constraints for Supporting the Quality of Databases. InProceedings of the International Conference on Computational Science and Its Applications (ICCSA) (Lecture Notes in Computer Science). Springer, Santander, Spain, 283–292. doi:10.1007/978-3-642-21934-4_24

  27. [27]

    Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qionghai Dai. 2017. Deep Direct Reinforcement Learning for Financial Signal Representation and Trading.IEEE Transactions on Neural Networks and Learning Systems28, 3 (2017), 653–664. doi:10.1109/TNNLS.2016.2522401

  28. [28]

    Xiaoou Ding, Hongzhi Wang, Genglong Li, Haoxuan Li, Yingze Li, and Yida Liu. 2022. IoT data cleaning techniques: A survey.Intell. Converged Networks3, 4 (2022), 325–339. doi:10.23919/ICN.2022.0026

  29. [29]

    Fabien Duchateau and Zohra Bellahsene. 2010. Measuring the Quality of an Integrated Schema. InProceedings of the International Conference on Conceptual Modeling (ER) (Lecture Notes in Computer Science, Vol. 6412). Springer, Vancouver, BC, Canada, 261–273. https://doi.org/10.1007/978-3- 642-16373-9_19

  30. [30]

    Lisa Ehrlinger, Bernhard Werth, and Wolfram Wöß. 2023. Automating Data Quality Monitoring with Reference Data Profiles. InData Management Technologies and Applications. Springer Nature Switzerland, Cham, 24–44

  31. [32]

    Lisa Ehrlinger and Wolfram Wöß. 2018. A Novel Data Quality Metric for Minimality. InData Quality and Trust in Big Data - International Workshop, QUAT, Revised Selected Papers (Lecture Notes in Computer Science, Vol. 11235). Springer, 1–15. doi:10.1007/978-3-030-19143-6_1

  32. [33]

    Lisa Ehrlinger and Wolfram Wöß. 2022. A Survey of Data Quality Measurement and Monitoring Tools.Frontiers in Big Data5 (2022), 850611. doi:10.3389/FDATA.2022.850611

  33. [34]

    Elhoucine Elfatimi, Recep Eryigit, and Harisu Abdullahi Shehu. 2024. Impact of datasets on the effectiveness of MobileNet for beans leaf disease detection.Neural Comput. Appl.36, 4 (2024), 1773–1789. doi:10.1007/S00521-023-09187-4

  34. [35]

    Elmagarmid, Panagiotis G

    Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate Record Detection: A Survey.IEEE Transactions on Knowledge and Data Engineering (TKDE)19, 1 (2007), 1–16. doi:10.1109/TKDE.2007.250581

  35. [36]

    Maupong, Dimane Mpoeleng, Thabo Semong, Banyatsang Mphago, and Oteng Tabona

    Tlamelo Emmanuel, Thabiso M. Maupong, Dimane Mpoeleng, Thabo Semong, Banyatsang Mphago, and Oteng Tabona. 2021. A survey on missing data in machine learning.J. Big Data8, 1 (2021), 140. doi:10.1186/S40537-021-00516-9

  36. [37]

    Wenfei Fan. 2015. Data Quality: From Theory to Practice.SIGMOD Rec.44, 3 (2015), 7–18. doi:10.1145/2854006.2854008

  37. [38]

    2012.Foundations of Data Quality Management(1 ed.)

    Wenfei Fan and Floris Geerts. 2012.Foundations of Data Quality Management(1 ed.). Springer Cham. 201 pages. doi:10.1007/978-3-031-01892-3

  38. [39]

    Wenfei Fan, Floris Geerts, and Xibei Jia. 2008. A revival of integrity constraints for data cleaning.Proceedings of the VLDB Endowment (PVLDB)1, 2 (2008), 1522–1523. doi:10.14778/1454159.1454220

  39. [40]

    Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2008. Conditional functional dependencies for capturing data inconsistencies. TODS33, 2 (2008), 6:1–6:48. doi:10.1145/1366102.1366103

  40. [41]

    Emilio Ferrara. 2024. Fairness and Bias in Artificial Intelligence: A Brief Survey of Sources, Impacts, and Mitigation Strategies.Sci6, 1 (2024). doi:10.3390/sci6010003

  41. [42]

    Christian Fürber and Martin Hepp. 2011. Towards a vocabulary for data quality management in semantic web architectures. InProceedings of the International Workshop on Linked Web Data Management (LWDM). ACM, New York, NY, USA, 1–8. doi:10.1145/1966901.1966903

  42. [43]

    2013.Data cleaning: A practical perspective

    Venkatesh Ganti and Anish Das Sarma. 2013.Data cleaning: A practical perspective. Morgan & Claypool Publishers

  43. [44]

    Ullman, and Jennifer Widom

    Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009.Database systems - the complete book (2. ed.). Pearson Education

  44. [45]

    Mouzhi Ge and Markus Helfert. 2007. A Review of Information Quality Research - Develop a Research Agenda. InProceedings of the International Conference on Information Quality (ICIQ). MIT, Cambridge, MA, USA, 76–91

  45. [46]

    Karloff, Flip Korn, Divesh Srivastava, and Bei Yu

    Lukasz Golab, Howard J. Karloff, Flip Korn, Divesh Srivastava, and Bei Yu. 2008. On generating near-optimal tableaux for conditional functional dependencies.Proceedings of the VLDB Endowment (PVLDB)1, 1 (2008), 376–390. doi:10.14778/1453856.1453900 Manuscript submitted to ACM A Catalog of Data Errors 31

  46. [47]

    Gösta Grahne. 2018. Incomplete Information. InEncyclopedia of Database Systems, Second Edition. Springer, New York, NY. doi:10.1007/978-1-4614- 8265-9_1241

  47. [48]

    Grefen and Peter M.G

    Paul W.P.J. Grefen and Peter M.G. Apers. 1993. Integrity control in relational database systems — an overview.DKE10, 2 (1993), 187–223. doi:10.1016/0169-023X(93)90008-D

  48. [49]

    Hassenstein and Patrizio Vanella

    Max J. Hassenstein and Patrizio Vanella. 2022. Data Quality—Concepts and Problems.Encyclopedia2, 1 (2022), 498–510. doi:10.3390/ encyclopedia2010032

  49. [50]

    Simon Hawkins, Hongxing He, Graham Williams, and Rohan Baxter. 2002. Outlier Detection Using Replicator Neural Networks. InProceedings of the International Conference on Data Warehousing and Knowledge Discovery (DaWaK). Springer, Aix-en-Provence, France, 170–180

  50. [51]

    Ilyas, and Theodoros Rekatsinas

    Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, and Theodoros Rekatsinas. 2019. HoloDetect: Few-Shot Learning for Error Detection. InProceedings of the International Conference on Management of Data (SIGMOD). ACM, 829–846. doi:10.1145/3299869.3319888

  51. [52]

    Mario Alberto Sosa Hidalgo, Ella Hafermalz, Wendy Günther, and Marleen Huysman. 2024. The Ongoing Quest for Complicatedness: How Data Science Practitioners Manage Their Emerging Role in Organizations. InProceedings of the European Conference on Information Systems (ECIS). AIS, Paphos, Cyprus, 1536–1552. https://aisel.aisnet.org/ecis2024/track06_humanaicol...

  52. [53]

    Ilyas and Xu Chu

    Ihab F. Ilyas and Xu Chu. 2015. Trends in cleaning relational data: Consistency and deduplication.Foundations and Trends®in Databases5, 4 (2015), 281–393

  53. [54]

    2019.Data Cleaning

    Ihab F Ilyas and Xu Chu. 2019.Data Cleaning. Morgan & Claypool

  54. [55]

    Ilyas and Felix Naumann

    Ihab F. Ilyas and Felix Naumann. 2022. Data Errors: Symptoms, Causes and Origins.IEEE Data Engineering Bulletin45, 1 (2022), 4–9. http: //sites.computer.org/debull/A22mar/p4.pdf

  55. [56]

    Alistair E

    Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Christopher Chute, Henrik Marklund, Behzad Haghgoo, Robyn L. Ball, Katie S. Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, and Andrew Y. Ng. 2019. CheXpert: A La...

  56. [57]

    Noman Islam, Zeeshan Islam, and Nazia Noor. 2017. A Survey on Optical Character Recognition System.CoRRabs/1710.05703 (2017). arXiv:1710.05703

  57. [58]

    Standard

    ISO 8000-8:2015(E) 2015.Data Quality – Part 8: Information and Data Quality Concepts and Measuring. Standard. International Organization for Standardization. https://www.iso.org/standard/60805.html

  58. [59]

    Ole Guttorm Jensen and Michael H. Böhlen. 2002. Current, Legacy, and Invalid Tuples in Conditionally Evolving Databases. InProceedings of the International Conference on Advances in Information Systems (ADVIS) (Lecture Notes in Computer Science, Vol. 2457). Springer, 65–82. doi:10.1007/3-540-36077-8_7

  59. [60]

    John A. Johnson. 2005. Ascertaining the validity of individual protocols from Web-based personality inventories.Journal of Research in Personality 39, 1 (2005), 103–129. doi:10.1016/j.jrp.2004.09.009 Proceedings of the Association for Research in Personality

  60. [61]

    João Marcelo Borovina Josko. 2018. A Formal Taxonomy of Temporal Data Defects. InInternational Workshop on Data Quality and Trust (QUAT) (Lecture Notes in Computer Science, Vol. 11235). Springer, 94–110. doi:10.1007/978-3-030-19143-6_7

  61. [62]

    João Marcelo Borovina Josko, Marcio Katsumi Oikawa, and João Eduardo Ferreira. 2016. A Formal Taxonomy to Improve Data Defect Description. In Proceedings of the International Conference on Database Systems for Advanced Applications (DASFAA), Vol. 9645. Springer, 307–320. doi:10.1007/978- 3-319-32055-7_25

  62. [63]

    Daniel Jurafsky and James H. Martin. 2024.Speech and Language Processing. Prentice-Hall, Inc., Upper Saddle River, NJ, USA. https://web.stanford. edu/~jurafsky/slp3/

  63. [64]

    Karr, Ashish P

    Alan F. Karr, Ashish P. Sanil, and David L. Banks. 2006. Data quality: A statistical perspective.Statistical Methodology3, 2 (2006), 137–173. doi:10.1016/j.stamet.2005.08.005

  64. [65]

    Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin

    Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin

  65. [66]

    InProceedings of the International Conference on Management of Data (SIGMOD)

    BigDansing: A System for Big Data Cleansing. InProceedings of the International Conference on Management of Data (SIGMOD). ACM, 1215–1230. doi:10.1145/2723372.2747646

  66. [67]

    Kim and William E

    Jay J. Kim and William E. Winkler. 2003.Multiplicative Noise for Masking Continuous Data. Research Report 2003-01. Statistical Research Division, U.S. Census Bureau, Washington, DC. https://www.census.gov/library/working-papers/2003/adrm/rrs2003-01.html

  67. [68]

    Won Kim, Byoung-Ju Choi, Eui-Kyeong Hong, Soo-Kyung Kim, and Doheon Lee. 2003. A Taxonomy of Dirty Data.Data Mining and Knowledge Discovery7, 1 (2003), 81–99. doi:10.1023/A:1021564703268

  68. [69]

    Ido Kissos and Nachum Dershowitz. 2016. OCR Error Correction Using Character Correction and Feature-Based Word Classification. InProceedings of the IAPR Workshop on Document Analysis Systems (DAS). IEEE Computer Society, 198–203. doi:10.1109/DAS.2016.44

  69. [70]

    2009.Multivalued Dependency

    Solmaz Kolahi. 2009.Multivalued Dependency. Springer US, Boston, MA, 1865–1865. doi:10.1007/978-0-387-39940-9_1248

  70. [71]

    Karen Kukich. 1992. Techniques for automatically correcting words in text.Comput. Surveys24, 4 (1992), 377–439. doi:10.1145/146370.146380

  71. [72]

    Kurapati, Antonio Yaghy, and Aakriti G

    Sai S. Kurapati, Antonio Yaghy, and Aakriti G. Shukla. 2025. Data bias: ethical considerations for understanding diversity in medical artificial intelligence.AI Ethics5, 3 (2025), 3399–3405. doi:10.1007/S43681-024-00589-1

  72. [73]

    Public Law and An Act. 2002. Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled, Sarbanes-Oxley Act of 2002.Public Law107 (2002), 204. Manuscript submitted to ACM 32 Bhadauria et al

  73. [74]

    Lee, Leo L

    Yang W. Lee, Leo L. Pipino, James D. Funk, and Richard Y. Wang. 2006.Understanding the Anatomy of Data Quality Problems and Patterns. MIT Press, 79–108. doi:10.7551/mitpress/4037.003.0008

  74. [75]

    Liakos, Patrizia Busato, Dimitrios Moshou, Simon Pearson, and Dionysis Bochtis

    Konstantinos G. Liakos, Patrizia Busato, Dimitrios Moshou, Simon Pearson, and Dionysis Bochtis. 2018. Machine Learning in Agriculture: A Review.Sensors18, 8 (2018). doi:10.3390/s18082674

  75. [76]

    Roderick J. A. Little and Donald B. Rubin. 1983. Missing data in large data sets. InStatistical Methods and the Improvement of Data Quality. Academic Press, 215–243. doi:10.1016/B978-0-12-765480-5.50017-5

  76. [77]

    Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A Configuration-Free Error Detection System. InProceedings of the International Conference on Management of Data (SIGMOD). ACM, Amsterdam, 865–882

  77. [78]

    Damerau, and Robert L

    Eric Mays, Fred J. Damerau, and Robert L. Mercer. 1991. Context based spelling correction.Inf. Process. Manag.27, 5 (1991), 517–522. doi:10.1016/0306- 4573(91)90066-U

  78. [79]

    Arturas Mazeika and Michael H. Böhlen. 2006. Cleansing Databases of Misspelled Proper Nouns. InProceedings of the Int’l VLDB Workshop on Clean Databases

  79. [80]

    2005.The New Oxford American Dictionary

    Erin McKean. 2005.The New Oxford American Dictionary. Vol. 2. Oxford University Press New York, Oxford, UK

  80. [81]

    Michael Minock, Daniel Oskarsson, Björn Pelzer, and Mika Cohen. 2015. Natural Language Specification and Violation Reporting of Business Rules over ER-modeled Databases. InProceedings of the International Conference on Extending Database Technology, EDBT. OpenProceedings.org, 541–544. doi:10.5441/002/EDBT.2015.53

Showing first 80 references.