arxiv: 2604.09277 · v1 · submitted 2026-04-10 · 💻 cs.DB

Recognition: unknown

A Catalog of Data Errors

Divya Bhadauria , Hazar Harmouch , Felix Naumann , Divesh Srivastava , Lisa Ehrlinger

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:36 UTC · model grok-4.3

classification 💻 cs.DB

keywords data errorserror taxonomytabular datadata qualitydata cleaningmissing valuesoutliersbias

0 comments

The pith

A catalog defines 35 distinct error types for tabular data in three non-overlapping categories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors compile a list of data errors that harm databases and applications like machine learning. They include both obvious issues such as missing values and duplicates and less obvious ones like disguised missing values or word transpositions, along with statistical problems like outliers and bias. The catalog places these into three separate groups and gives each a formal definition plus an example. It also straightens out conflicting terms from previous informal lists. This matters because better organized knowledge of errors lets people build more effective tools to find and fix problems in data.

Core claim

The paper establishes a comprehensive catalog containing 35 distinct error types for tabular data. These are classified into three non-overlapping categories of missing, incorrect, and redundant. For every type the catalog supplies a formal definition and a practical example while also resolving inconsistencies in terminology from earlier research.

What carries the argument

The three-category classification of error types into missing, incorrect, and redundant, which organizes the 35 types and supports targeted detection and cleaning approaches.

If this is right

Data quality tools can implement error-specific detection and cleaning strategies for each listed type.
Researchers gain a standardized way to address both traditional data errors and statistical indicators such as bias.
Terminological inconsistencies across related work are resolved, improving clarity in the field.
The catalog supports systematic handling of errors that arise in database design and operation phases.
Practitioners can better prepare data for downstream tasks like analytics reports and machine learning pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The catalog might inspire similar classifications for non-tabular data formats such as graphs or text collections.
It could lead to benchmarks that measure how well tools handle the full range of these 35 error types.
Future studies might test whether using this taxonomy improves the performance of automated data cleaning systems.
Connections to error propagation through entire analysis pipelines could be explored using the categories.

Load-bearing premise

That the 35 error types are truly distinct, that the three categories are non-overlapping, and that the categories together cover every possible error in tabular data.

What would settle it

Discovery of a concrete error in tabular data that cannot be assigned to exactly one of the three categories without overlap or omission would falsify the catalog's claims.

Figures

Figures reproduced from arXiv: 2604.09277 by Divesh Srivastava, Divya Bhadauria, Felix Naumann, Hazar Harmouch, Lisa Ehrlinger.

**Figure 2.** Figure 2: Example instances of Employee, Certificate, Employeecertificate, and Department relations. Grayed-out tuples indicate the corresponding correct values of each attribute for a real-world element, indicating that the database values in those tuples are incorrect. For instance, if a tuple 𝑡 is incorrect, we represent the real-world values in a greyed-out tuple 𝑡 ′ . component is irrelevant, we omit trailing s… view at source ↗

**Figure 3.** Figure 3: A hierarchy of data error types (indicated with blue background) and error indicators (indicated with gray background). [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Data errors are widespread in real-world databases and severely impact downstream applications, such as machine learning pipelines or business analytics reports. Causes of such errors are manifold and can arise during both the design phase and the operational phase of a database. Some error types, such as missing values, duplicate tuples, or constraint violations, are widely recognized; others, such as disguised missing values or word transpositions, remain underexplored. Existing attempts to define and classify errors in data offer valuable but limited taxonomies, mostly informal and not covering the full range of error types. With the rise of AI, practitioners must increasingly detect and correct statistical errors such as bias and outliers, which are rarely considered within existing error taxonomies. This catalog presents a comprehensive list of 35 distinct error types, including both data errors (e.g., missing values, duplicate tuples) and error indicators (e.g., outliers, bias) for tabular data, classified into three non-overlapping categories: missing, incorrect, and redundant. For each error type, we provide a formal definition and practical example, and resolve terminological inconsistencies across related work. Our catalog enables researchers and practitioners to address various error types and systematically implement error-specific detection and cleaning strategies in data quality tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A practical reference catalog of 35 tabular data error types with definitions and examples, but mostly a synthesis that organizes rather than advances the field.

read the letter

This paper's core contribution is a structured list of 35 error types for tabular data, grouped into missing, incorrect, and redundant, each with a formal definition and an example. It also pulls in statistical indicators like outliers and bias that older taxonomies often skipped, and it flags some terminological overlaps from prior work. That organization is the main value here. It does a clean job of making the distinctions explicit and giving practitioners a single place to look up what counts as what. The three categories are presented as non-overlapping by construction, which helps avoid the usual muddle between missing values and disguised missing values or between duplicates and redundant records. The examples are concrete enough to be usable in tool-building discussions. The soft spots are mostly around the completeness claim. The paper asserts that 35 covers the full range, but without deeper justification or a systematic gap analysis against real-world datasets, it is hard to judge whether some types bleed into each other in practice or whether important edge cases were left out. Because the work is definitional rather than empirical or algorithmic, the resolution of terminology feels more like careful housekeeping than a breakthrough. It is a synthesis of existing literature, so the new element is the scale and the formal framing rather than any fresh observation about how errors actually arise or propagate. This is the kind of paper that data-quality researchers and people building cleaning pipelines would keep on their shelf as a reference. A reader who needs a shared vocabulary for error detection in ML or analytics work will find it useful. It is not going to change how anyone designs a new algorithm, but it could reduce the amount of time spent reinventing the same distinctions. I would send it to peer review. The classification is clear enough that referees can check the distinctions and suggest refinements without the paper needing major new experiments.

Referee Report

2 major / 2 minor

Summary. The paper presents a synthesized catalog of 35 distinct error types for tabular data, partitioned into three non-overlapping categories (missing, incorrect, and redundant). Each type is supplied with a formal definition and a practical example; the work also claims to resolve terminological inconsistencies across prior literature and to incorporate statistical error indicators (e.g., outliers, bias) relevant to AI pipelines.

Significance. If the claimed non-overlapping and comprehensive classification holds, the catalog could become a useful reference for standardizing terminology and guiding the implementation of error-specific detection and cleaning routines in data-quality tools and ML pipelines. The shift from informal taxonomies to formally defined entries is a positive step.

major comments (2)

[Classification and Definitions] The central claim that the three categories are non-overlapping and that the 35 types are distinct is asserted by construction via the supplied definitions, yet no explicit boundary-case analysis or overlap-resolution procedure is provided. For example, 'disguised missing values' could plausibly be classified under both missing and incorrect depending on the chosen definition; this needs a dedicated subsection demonstrating mutual exclusivity.
[Error Indicators subsection] The inclusion of error indicators such as 'outliers' and 'bias' alongside traditional data errors (e.g., duplicate tuples, constraint violations) is load-bearing for the comprehensiveness claim, but the manuscript does not clarify whether these are treated as errors themselves or as symptoms that may indicate other error types. This distinction affects downstream use in cleaning strategies and should be justified with a short decision tree or decision criteria.

minor comments (2)

A single summary table listing all 35 types, their category, a one-sentence definition, and key literature references would greatly improve usability and allow readers to verify the claimed resolution of terminological inconsistencies.
The abstract mentions 'word transpositions' as an underexplored type; ensure it appears in the catalog with its formal definition and that its placement under one of the three categories is unambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of the catalog. We address each major point below and will incorporate revisions to strengthen the presentation of the classification and the role of error indicators.

read point-by-point responses

Referee: The central claim that the three categories are non-overlapping and that the 35 types are distinct is asserted by construction via the supplied definitions, yet no explicit boundary-case analysis or overlap-resolution procedure is provided. For example, 'disguised missing values' could plausibly be classified under both missing and incorrect depending on the chosen definition; this needs a dedicated subsection demonstrating mutual exclusivity.

Authors: The definitions are constructed to enforce mutual exclusivity by focusing on the primary characteristic of each error: missing errors concern the absence or non-representation of data (with disguised missing values defined as placeholders that signal absence rather than incorrect content), incorrect errors involve present values that deviate from ground truth, and redundant errors involve unnecessary repetition. While this structure supports the non-overlapping claim, we agree that an explicit boundary-case analysis would improve clarity. We will add a dedicated subsection that examines potential overlaps, including disguised missing values, and outlines a resolution procedure based on the dominant error characteristic. revision: yes
Referee: The inclusion of error indicators such as 'outliers' and 'bias' alongside traditional data errors (e.g., duplicate tuples, constraint violations) is load-bearing for the comprehensiveness claim, but the manuscript does not clarify whether these are treated as errors themselves or as symptoms that may indicate other error types. This distinction affects downstream use in cleaning strategies and should be justified with a short decision tree or decision criteria.

Authors: The manuscript positions outliers and bias as error indicators that are themselves catalogued as distinct types because they directly affect data quality and downstream AI pipelines, even when they may arise from or point to other issues. This is distinct from treating them solely as symptoms. To make the distinction explicit and support practical use, we will expand the Error Indicators subsection with decision criteria that differentiate primary error indicators from potential symptoms, including implications for cleaning strategies. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a literature synthesis that compiles and formally defines 35 error types for tabular data, partitioned into the three categories of missing, incorrect, and redundant. No equations, predictions, fitted parameters, or deductive derivations appear in the work. The asserted properties of distinctness and non-overlap follow directly from the explicit definitions supplied for each type, which is a standard definitional construction rather than a reduction of any claim to its own inputs. All source material is drawn from external prior taxonomies; the paper's contribution is organization and terminological clarification, with no self-citation chains or uniqueness theorems invoked as load-bearing premises.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The catalog is constructed from existing literature on data errors, without introducing new free parameters, axioms, or invented entities; it synthesizes and formalizes known concepts.

pith-pipeline@v0.9.0 · 5523 in / 1074 out tokens · 35297 ms · 2026-05-10T16:36:17.185273+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

128 extracted references · 89 canonical work pages

[1]

Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang

Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang
[2]

doi:10.14778/2994509.2994518

Detecting Data Errors: Where Are We and What Needs to Be Done?Proceedings of the VLDB Endowment (PVLDB)9, 12 (2016), 993–1004. doi:10.14778/2994509.2994518

work page doi:10.14778/2994509.2994518 2016
[3]

Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: a survey.VLDB Journal24, 4 (2015), 557–581. doi:10.1007/ S00778-015-0389-Y

2015
[4]

1995.Foundations of Databases

Serge Abiteboul, Richard Hull, and Victor Vianu. 1995.Foundations of Databases. Addison-Wesley. http://webdam.inria.fr/Alice/

1995
[5]

Ashish Agrawal. 2011. Semantics of business process vocabulary and process rules. InProceedings of the India Software Engineering Conference (Thiruvananthapuram, Kerala, India)(ISEC). ACM, New York, NY, USA, 61–68. doi:10.1145/1953355.1953363

work page doi:10.1145/1953355.1953363 2011
[6]

Sarah Alsufyani, Matthew Forshaw, and Sara Johansson Fernstad. 2024. Visualization of missing data: a state-of-the-art survey.CoRRabs/2410.03712 (2024). arXiv:2410.03712 doi:10.48550/ARXIV.2410.03712

work page doi:10.48550/arxiv.2410.03712 2024
[7]

Lyublena Antova, Christoph Koch, and Dan Olteanu. 2007. From complete to incomplete information and back. InProceedings of the International Conference on Management of Data (SIGMOD). ACM, New York, NY, USA, 713–724. doi:10.1145/1247480.1247559

work page doi:10.1145/1247480.1247559 2007
[8]

Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. Methodologies for data quality assessment and improvement. Comput. Surveys41, 3, Article 16 (2009). doi:10.1145/1541880.1541883

work page doi:10.1145/1541880.1541883 2009
[9]

Maria C. M. Batista and Ana Carolina Salgado. 2007. Information Quality Measurement in Data Integration Schemas. InProceedings of the International Workshop on Quality in Databases (QDB). ACM, New York, NY, USA, 61–72

2007
[10]

Michal Bechny, Florian Sobieczky, Jürgen Zeindl, and Lisa Ehrlinger. 2021. Missing Data Patterns: From Theory to an Application in the Steel Industry. InProceedings of the International Conference on Scientific and Statistical Database Management (SSDBM). ACM, New York, NY, USA, 214–219. doi:10.1145/3468791.3468841

work page doi:10.1145/3468791.3468841 2021
[11]

Ilyas, and Lukasz Golab

George Beskales, Ihab F. Ilyas, and Lukasz Golab. 2010. Sampling the Repairs of Functional Dependency Violations under Hard Constraints. Proceedings of the VLDB Endowment (PVLDB)3, 1 (2010), 197–207. doi:10.14778/1920841.1920870

work page doi:10.14778/1920841.1920870 2010
[12]

Divya Bhadauria, Alejandro Sierra Múnera, and Ralf Krestel. 2024. The Effects of Data Quality on Named Entity Recognition. InProceedings of the Workshop on Noisy and User-generated Text (W-NUT). Association for Computational Linguistics, San˙Giljan, Malta, 79–88. doi:10.18653/v1/2024.wnut- 1.8

work page doi:10.18653/v1/2024.wnut- 2024
[13]

Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2007. Conditional Functional Dependencies for Data Cleaning. InProceedings of the International Conference on Data Engineering, ICDE. IEEE Computer Society, Los Alamitos, CA, USA, 746–755. doi:10.1109/ICDE.2007.367920

work page doi:10.1109/icde.2007.367920 2007
[14]

Orlando Amaral Cejas, Muhammad Ilyas Azeem, Sallam Abualhaija, and Lionel C. Briand. 2023. NLP-Based Automated Compliance Checking of Data Processing Agreements Against GDPR.IEEE Transactions on Software Engineering49, 9 (2023), 4282–4303. doi:10.1109/TSE.2023.3288901

work page doi:10.1109/tse.2023.3288901 2023
[15]

Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey.Comput. Surveys41, 3, Article 15 (July 2009), 58 pages. doi:10.1145/1541880.1541882

work page doi:10.1145/1541880.1541882 2009
[16]

Mingqing Chen, Rajiv Mathews, Tom Ouyang, and Françoise Beaufays. 2019. Federated Learning Of Out-Of-Vocabulary Words.CoRRabs/1903.10635 (2019). arXiv:1903.10635

work page arXiv 2019
[17]

Chrisman

Nicholas R. Chrisman. 1983. The Role of Quality Information in the Long-Term Functioning of a Geographic Information System.Cartographica: The International Journal for Geographic Information and Geovisualization21, 2 (1983), 79–88

1983
[18]

Xu Chu and Ihab F. Ilyas. 2016. Qualitative Data Cleaning.Proceedings of the VLDB Endowment (PVLDB)9, 13 (2016), 1605–1608. doi:10.14778/ 3007263.3007320 Manuscript submitted to ACM 30 Bhadauria et al

work page arXiv 2016
[19]

Ilyas, Sanjay Krishnan, and Jiannan Wang

Xu Chu, Ihab F. Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data Cleaning: Overview and Emerging Challenges. InProceedings of the International Conference on Management of Data (SIGMOD). ACM, 2201–2206. doi:10.1145/2882903.2912574

work page doi:10.1145/2882903.2912574 2016
[20]

Gianluca Cima, Marco Console, and Maurizio Lenzerini. 2025. Ontology-Based Schema-Level Data Quality: The Case of Consistency.Journal on Data and Information Quality (JDIQ)17, 4, Article 22 (dec 2025), 25 pages. doi:10.1145/3770750

work page doi:10.1145/3770750 2025
[21]

Andrea Colagrossi, Vincenzo Pesce, Stefano Silvestrini, David Gonzalez-Arjona, Pablo Hermosin, and Matteo Battilana. 2023. Chapter Six - Sensors. InModern Spacecraft Guidance, Navigation, and Control. Elsevier, 253–336. doi:10.1016/B978-0-323-90916-7.00006-8

work page doi:10.1016/b978-0-323-90916-7.00006-8 2023
[22]

Giuseppe Colavito, Filippo Lanubile, Nicole Novielli, and Luigi Quaranta. 2024. Impact of data quality for automatic issue classification using pre-trained language models.J. Syst. Softw.210 (2024), 111838. doi:10.1016/J.JSS.2023.111838

work page doi:10.1016/j.jss.2023.111838 2024
[23]

Cox, and Anthony D

Dena Cox, Jeffrey G. Cox, and Anthony D. Cox. 2017. To Err is human? How typographical and orthographical errors affect perceptions of online reviewers.Comput. Hum. Behav.75 (2017), 245–253. doi:10.1016/J.CHB.2017.05.008

work page doi:10.1016/j.chb.2017.05.008 2017
[24]

Paul G. Curran. 2016. Methods for the detection of carelessly invalid responses in survey data.Journal of Experimental Social Psychology66 (2016), 4–19. doi:10.1016/j.jesp.2015.07.006 Rigorous and Replicable Methods in Social Psychology

work page doi:10.1016/j.jesp.2015.07.006 2016
[25]

Fred J. Damerau. 1964. A technique for computer detection and correction of spelling errors.Commun. ACM7, 3 (March 1964), 171–176. doi:10.1145/363958.363994

work page doi:10.1145/363958.363994 1964
[26]

Hendrik Decker. 2011. Causes of the Violation of Integrity Constraints for Supporting the Quality of Databases. InProceedings of the International Conference on Computational Science and Its Applications (ICCSA) (Lecture Notes in Computer Science). Springer, Santander, Spain, 283–292. doi:10.1007/978-3-642-21934-4_24

work page doi:10.1007/978-3-642-21934-4_24 2011
[27]

Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qionghai Dai. 2017. Deep Direct Reinforcement Learning for Financial Signal Representation and Trading.IEEE Transactions on Neural Networks and Learning Systems28, 3 (2017), 653–664. doi:10.1109/TNNLS.2016.2522401

work page doi:10.1109/tnnls.2016.2522401 2017
[28]

Xiaoou Ding, Hongzhi Wang, Genglong Li, Haoxuan Li, Yingze Li, and Yida Liu. 2022. IoT data cleaning techniques: A survey.Intell. Converged Networks3, 4 (2022), 325–339. doi:10.23919/ICN.2022.0026

work page doi:10.23919/icn.2022.0026 2022
[29]

Fabien Duchateau and Zohra Bellahsene. 2010. Measuring the Quality of an Integrated Schema. InProceedings of the International Conference on Conceptual Modeling (ER) (Lecture Notes in Computer Science, Vol. 6412). Springer, Vancouver, BC, Canada, 261–273. https://doi.org/10.1007/978-3- 642-16373-9_19

work page doi:10.1007/978-3- 2010
[30]

Lisa Ehrlinger, Bernhard Werth, and Wolfram Wöß. 2023. Automating Data Quality Monitoring with Reference Data Profiles. InData Management Technologies and Applications. Springer Nature Switzerland, Cham, 24–44

2023
[32]

Lisa Ehrlinger and Wolfram Wöß. 2018. A Novel Data Quality Metric for Minimality. InData Quality and Trust in Big Data - International Workshop, QUAT, Revised Selected Papers (Lecture Notes in Computer Science, Vol. 11235). Springer, 1–15. doi:10.1007/978-3-030-19143-6_1

work page doi:10.1007/978-3-030-19143-6_1 2018
[33]

Lisa Ehrlinger and Wolfram Wöß. 2022. A Survey of Data Quality Measurement and Monitoring Tools.Frontiers in Big Data5 (2022), 850611. doi:10.3389/FDATA.2022.850611

work page doi:10.3389/fdata.2022.850611 2022
[34]

Elhoucine Elfatimi, Recep Eryigit, and Harisu Abdullahi Shehu. 2024. Impact of datasets on the effectiveness of MobileNet for beans leaf disease detection.Neural Comput. Appl.36, 4 (2024), 1773–1789. doi:10.1007/S00521-023-09187-4

work page doi:10.1007/s00521-023-09187-4 2024
[35]

Elmagarmid, Panagiotis G

Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate Record Detection: A Survey.IEEE Transactions on Knowledge and Data Engineering (TKDE)19, 1 (2007), 1–16. doi:10.1109/TKDE.2007.250581

work page doi:10.1109/tkde.2007.250581 2007
[36]

Maupong, Dimane Mpoeleng, Thabo Semong, Banyatsang Mphago, and Oteng Tabona

Tlamelo Emmanuel, Thabiso M. Maupong, Dimane Mpoeleng, Thabo Semong, Banyatsang Mphago, and Oteng Tabona. 2021. A survey on missing data in machine learning.J. Big Data8, 1 (2021), 140. doi:10.1186/S40537-021-00516-9

work page doi:10.1186/s40537-021-00516-9 2021
[37]

Wenfei Fan. 2015. Data Quality: From Theory to Practice.SIGMOD Rec.44, 3 (2015), 7–18. doi:10.1145/2854006.2854008

work page doi:10.1145/2854006.2854008 2015
[38]

2012.Foundations of Data Quality Management(1 ed.)

Wenfei Fan and Floris Geerts. 2012.Foundations of Data Quality Management(1 ed.). Springer Cham. 201 pages. doi:10.1007/978-3-031-01892-3

work page doi:10.1007/978-3-031-01892-3 2012
[39]

Wenfei Fan, Floris Geerts, and Xibei Jia. 2008. A revival of integrity constraints for data cleaning.Proceedings of the VLDB Endowment (PVLDB)1, 2 (2008), 1522–1523. doi:10.14778/1454159.1454220

work page doi:10.14778/1454159.1454220 2008
[40]

Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2008. Conditional functional dependencies for capturing data inconsistencies. TODS33, 2 (2008), 6:1–6:48. doi:10.1145/1366102.1366103

work page doi:10.1145/1366102.1366103 2008
[41]

Emilio Ferrara. 2024. Fairness and Bias in Artificial Intelligence: A Brief Survey of Sources, Impacts, and Mitigation Strategies.Sci6, 1 (2024). doi:10.3390/sci6010003

work page doi:10.3390/sci6010003 2024
[42]

Christian Fürber and Martin Hepp. 2011. Towards a vocabulary for data quality management in semantic web architectures. InProceedings of the International Workshop on Linked Web Data Management (LWDM). ACM, New York, NY, USA, 1–8. doi:10.1145/1966901.1966903

work page doi:10.1145/1966901.1966903 2011
[43]

2013.Data cleaning: A practical perspective

Venkatesh Ganti and Anish Das Sarma. 2013.Data cleaning: A practical perspective. Morgan & Claypool Publishers

2013
[44]

Ullman, and Jennifer Widom

Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009.Database systems - the complete book (2. ed.). Pearson Education

2009
[45]

Mouzhi Ge and Markus Helfert. 2007. A Review of Information Quality Research - Develop a Research Agenda. InProceedings of the International Conference on Information Quality (ICIQ). MIT, Cambridge, MA, USA, 76–91

2007
[46]

Karloff, Flip Korn, Divesh Srivastava, and Bei Yu

Lukasz Golab, Howard J. Karloff, Flip Korn, Divesh Srivastava, and Bei Yu. 2008. On generating near-optimal tableaux for conditional functional dependencies.Proceedings of the VLDB Endowment (PVLDB)1, 1 (2008), 376–390. doi:10.14778/1453856.1453900 Manuscript submitted to ACM A Catalog of Data Errors 31

work page doi:10.14778/1453856.1453900 2008
[47]

Gösta Grahne. 2018. Incomplete Information. InEncyclopedia of Database Systems, Second Edition. Springer, New York, NY. doi:10.1007/978-1-4614- 8265-9_1241

work page doi:10.1007/978-1-4614- 2018
[48]

Grefen and Peter M.G

Paul W.P.J. Grefen and Peter M.G. Apers. 1993. Integrity control in relational database systems — an overview.DKE10, 2 (1993), 187–223. doi:10.1016/0169-023X(93)90008-D

work page doi:10.1016/0169-023x(93)90008-d 1993
[49]

Hassenstein and Patrizio Vanella

Max J. Hassenstein and Patrizio Vanella. 2022. Data Quality—Concepts and Problems.Encyclopedia2, 1 (2022), 498–510. doi:10.3390/ encyclopedia2010032

2022
[50]

Simon Hawkins, Hongxing He, Graham Williams, and Rohan Baxter. 2002. Outlier Detection Using Replicator Neural Networks. InProceedings of the International Conference on Data Warehousing and Knowledge Discovery (DaWaK). Springer, Aix-en-Provence, France, 170–180

2002
[51]

Ilyas, and Theodoros Rekatsinas

Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, and Theodoros Rekatsinas. 2019. HoloDetect: Few-Shot Learning for Error Detection. InProceedings of the International Conference on Management of Data (SIGMOD). ACM, 829–846. doi:10.1145/3299869.3319888

work page doi:10.1145/3299869.3319888 2019
[52]

Mario Alberto Sosa Hidalgo, Ella Hafermalz, Wendy Günther, and Marleen Huysman. 2024. The Ongoing Quest for Complicatedness: How Data Science Practitioners Manage Their Emerging Role in Organizations. InProceedings of the European Conference on Information Systems (ECIS). AIS, Paphos, Cyprus, 1536–1552. https://aisel.aisnet.org/ecis2024/track06_humanaicol...

2024
[53]

Ilyas and Xu Chu

Ihab F. Ilyas and Xu Chu. 2015. Trends in cleaning relational data: Consistency and deduplication.Foundations and Trends®in Databases5, 4 (2015), 281–393

2015
[54]

2019.Data Cleaning

Ihab F Ilyas and Xu Chu. 2019.Data Cleaning. Morgan & Claypool

2019
[55]

Ilyas and Felix Naumann

Ihab F. Ilyas and Felix Naumann. 2022. Data Errors: Symptoms, Causes and Origins.IEEE Data Engineering Bulletin45, 1 (2022), 4–9. http: //sites.computer.org/debull/A22mar/p4.pdf

2022
[56]

Alistair E

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Christopher Chute, Henrik Marklund, Behzad Haghgoo, Robyn L. Ball, Katie S. Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, and Andrew Y. Ng. 2019. CheXpert: A La...

work page doi:10.1609/aaai.v33i01.3301590 2019
[57]

Noman Islam, Zeeshan Islam, and Nazia Noor. 2017. A Survey on Optical Character Recognition System.CoRRabs/1710.05703 (2017). arXiv:1710.05703

work page arXiv 2017
[58]

Standard

ISO 8000-8:2015(E) 2015.Data Quality – Part 8: Information and Data Quality Concepts and Measuring. Standard. International Organization for Standardization. https://www.iso.org/standard/60805.html

2015
[59]

Ole Guttorm Jensen and Michael H. Böhlen. 2002. Current, Legacy, and Invalid Tuples in Conditionally Evolving Databases. InProceedings of the International Conference on Advances in Information Systems (ADVIS) (Lecture Notes in Computer Science, Vol. 2457). Springer, 65–82. doi:10.1007/3-540-36077-8_7

work page doi:10.1007/3-540-36077-8_7 2002
[60]

John A. Johnson. 2005. Ascertaining the validity of individual protocols from Web-based personality inventories.Journal of Research in Personality 39, 1 (2005), 103–129. doi:10.1016/j.jrp.2004.09.009 Proceedings of the Association for Research in Personality

work page doi:10.1016/j.jrp.2004.09.009 2005
[61]

João Marcelo Borovina Josko. 2018. A Formal Taxonomy of Temporal Data Defects. InInternational Workshop on Data Quality and Trust (QUAT) (Lecture Notes in Computer Science, Vol. 11235). Springer, 94–110. doi:10.1007/978-3-030-19143-6_7

work page doi:10.1007/978-3-030-19143-6_7 2018
[62]

João Marcelo Borovina Josko, Marcio Katsumi Oikawa, and João Eduardo Ferreira. 2016. A Formal Taxonomy to Improve Data Defect Description. In Proceedings of the International Conference on Database Systems for Advanced Applications (DASFAA), Vol. 9645. Springer, 307–320. doi:10.1007/978- 3-319-32055-7_25

work page doi:10.1007/978- 2016
[63]

Daniel Jurafsky and James H. Martin. 2024.Speech and Language Processing. Prentice-Hall, Inc., Upper Saddle River, NJ, USA. https://web.stanford. edu/~jurafsky/slp3/

2024
[64]

Karr, Ashish P

Alan F. Karr, Ashish P. Sanil, and David L. Banks. 2006. Data quality: A statistical perspective.Statistical Methodology3, 2 (2006), 137–173. doi:10.1016/j.stamet.2005.08.005

work page doi:10.1016/j.stamet.2005.08.005 2006
[65]

Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin

Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin
[66]

InProceedings of the International Conference on Management of Data (SIGMOD)

BigDansing: A System for Big Data Cleansing. InProceedings of the International Conference on Management of Data (SIGMOD). ACM, 1215–1230. doi:10.1145/2723372.2747646

work page doi:10.1145/2723372.2747646
[67]

Kim and William E

Jay J. Kim and William E. Winkler. 2003.Multiplicative Noise for Masking Continuous Data. Research Report 2003-01. Statistical Research Division, U.S. Census Bureau, Washington, DC. https://www.census.gov/library/working-papers/2003/adrm/rrs2003-01.html

2003
[68]

Won Kim, Byoung-Ju Choi, Eui-Kyeong Hong, Soo-Kyung Kim, and Doheon Lee. 2003. A Taxonomy of Dirty Data.Data Mining and Knowledge Discovery7, 1 (2003), 81–99. doi:10.1023/A:1021564703268

work page doi:10.1023/a:1021564703268 2003
[69]

Ido Kissos and Nachum Dershowitz. 2016. OCR Error Correction Using Character Correction and Feature-Based Word Classification. InProceedings of the IAPR Workshop on Document Analysis Systems (DAS). IEEE Computer Society, 198–203. doi:10.1109/DAS.2016.44

work page doi:10.1109/das.2016.44 2016
[70]

2009.Multivalued Dependency

Solmaz Kolahi. 2009.Multivalued Dependency. Springer US, Boston, MA, 1865–1865. doi:10.1007/978-0-387-39940-9_1248

work page doi:10.1007/978-0-387-39940-9_1248 2009
[71]

Karen Kukich. 1992. Techniques for automatically correcting words in text.Comput. Surveys24, 4 (1992), 377–439. doi:10.1145/146370.146380

work page doi:10.1145/146370.146380 1992
[72]

Kurapati, Antonio Yaghy, and Aakriti G

Sai S. Kurapati, Antonio Yaghy, and Aakriti G. Shukla. 2025. Data bias: ethical considerations for understanding diversity in medical artificial intelligence.AI Ethics5, 3 (2025), 3399–3405. doi:10.1007/S43681-024-00589-1

work page doi:10.1007/s43681-024-00589-1 2025
[73]

Public Law and An Act. 2002. Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled, Sarbanes-Oxley Act of 2002.Public Law107 (2002), 204. Manuscript submitted to ACM 32 Bhadauria et al

2002
[74]

Lee, Leo L

Yang W. Lee, Leo L. Pipino, James D. Funk, and Richard Y. Wang. 2006.Understanding the Anatomy of Data Quality Problems and Patterns. MIT Press, 79–108. doi:10.7551/mitpress/4037.003.0008

work page doi:10.7551/mitpress/4037.003.0008 2006
[75]

Liakos, Patrizia Busato, Dimitrios Moshou, Simon Pearson, and Dionysis Bochtis

Konstantinos G. Liakos, Patrizia Busato, Dimitrios Moshou, Simon Pearson, and Dionysis Bochtis. 2018. Machine Learning in Agriculture: A Review.Sensors18, 8 (2018). doi:10.3390/s18082674

work page doi:10.3390/s18082674 2018
[76]

Roderick J. A. Little and Donald B. Rubin. 1983. Missing data in large data sets. InStatistical Methods and the Improvement of Data Quality. Academic Press, 215–243. doi:10.1016/B978-0-12-765480-5.50017-5

work page doi:10.1016/b978-0-12-765480-5.50017-5 1983
[77]

Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A Configuration-Free Error Detection System. InProceedings of the International Conference on Management of Data (SIGMOD). ACM, Amsterdam, 865–882

2019
[78]

Damerau, and Robert L

Eric Mays, Fred J. Damerau, and Robert L. Mercer. 1991. Context based spelling correction.Inf. Process. Manag.27, 5 (1991), 517–522. doi:10.1016/0306- 4573(91)90066-U

work page doi:10.1016/0306- 1991
[79]

Arturas Mazeika and Michael H. Böhlen. 2006. Cleansing Databases of Misspelled Proper Nouns. InProceedings of the Int’l VLDB Workshop on Clean Databases

2006
[80]

2005.The New Oxford American Dictionary

Erin McKean. 2005.The New Oxford American Dictionary. Vol. 2. Oxford University Press New York, Oxford, UK

2005
[81]

Michael Minock, Daniel Oskarsson, Björn Pelzer, and Mika Cohen. 2015. Natural Language Specification and Violation Reporting of Business Rules over ER-modeled Databases. InProceedings of the International Conference on Extending Database Technology, EDBT. OpenProceedings.org, 541–544. doi:10.5441/002/EDBT.2015.53

work page doi:10.5441/002/edbt.2015.53 2015

Showing first 80 references.