Recognition: unknown
A Catalog of Data Errors
Pith reviewed 2026-05-10 16:36 UTC · model grok-4.3
The pith
A catalog defines 35 distinct error types for tabular data in three non-overlapping categories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes a comprehensive catalog containing 35 distinct error types for tabular data. These are classified into three non-overlapping categories of missing, incorrect, and redundant. For every type the catalog supplies a formal definition and a practical example while also resolving inconsistencies in terminology from earlier research.
What carries the argument
The three-category classification of error types into missing, incorrect, and redundant, which organizes the 35 types and supports targeted detection and cleaning approaches.
If this is right
- Data quality tools can implement error-specific detection and cleaning strategies for each listed type.
- Researchers gain a standardized way to address both traditional data errors and statistical indicators such as bias.
- Terminological inconsistencies across related work are resolved, improving clarity in the field.
- The catalog supports systematic handling of errors that arise in database design and operation phases.
- Practitioners can better prepare data for downstream tasks like analytics reports and machine learning pipelines.
Where Pith is reading between the lines
- The catalog might inspire similar classifications for non-tabular data formats such as graphs or text collections.
- It could lead to benchmarks that measure how well tools handle the full range of these 35 error types.
- Future studies might test whether using this taxonomy improves the performance of automated data cleaning systems.
- Connections to error propagation through entire analysis pipelines could be explored using the categories.
Load-bearing premise
That the 35 error types are truly distinct, that the three categories are non-overlapping, and that the categories together cover every possible error in tabular data.
What would settle it
Discovery of a concrete error in tabular data that cannot be assigned to exactly one of the three categories without overlap or omission would falsify the catalog's claims.
Figures
read the original abstract
Data errors are widespread in real-world databases and severely impact downstream applications, such as machine learning pipelines or business analytics reports. Causes of such errors are manifold and can arise during both the design phase and the operational phase of a database. Some error types, such as missing values, duplicate tuples, or constraint violations, are widely recognized; others, such as disguised missing values or word transpositions, remain underexplored. Existing attempts to define and classify errors in data offer valuable but limited taxonomies, mostly informal and not covering the full range of error types. With the rise of AI, practitioners must increasingly detect and correct statistical errors such as bias and outliers, which are rarely considered within existing error taxonomies. This catalog presents a comprehensive list of 35 distinct error types, including both data errors (e.g., missing values, duplicate tuples) and error indicators (e.g., outliers, bias) for tabular data, classified into three non-overlapping categories: missing, incorrect, and redundant. For each error type, we provide a formal definition and practical example, and resolve terminological inconsistencies across related work. Our catalog enables researchers and practitioners to address various error types and systematically implement error-specific detection and cleaning strategies in data quality tools.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a synthesized catalog of 35 distinct error types for tabular data, partitioned into three non-overlapping categories (missing, incorrect, and redundant). Each type is supplied with a formal definition and a practical example; the work also claims to resolve terminological inconsistencies across prior literature and to incorporate statistical error indicators (e.g., outliers, bias) relevant to AI pipelines.
Significance. If the claimed non-overlapping and comprehensive classification holds, the catalog could become a useful reference for standardizing terminology and guiding the implementation of error-specific detection and cleaning routines in data-quality tools and ML pipelines. The shift from informal taxonomies to formally defined entries is a positive step.
major comments (2)
- [Classification and Definitions] The central claim that the three categories are non-overlapping and that the 35 types are distinct is asserted by construction via the supplied definitions, yet no explicit boundary-case analysis or overlap-resolution procedure is provided. For example, 'disguised missing values' could plausibly be classified under both missing and incorrect depending on the chosen definition; this needs a dedicated subsection demonstrating mutual exclusivity.
- [Error Indicators subsection] The inclusion of error indicators such as 'outliers' and 'bias' alongside traditional data errors (e.g., duplicate tuples, constraint violations) is load-bearing for the comprehensiveness claim, but the manuscript does not clarify whether these are treated as errors themselves or as symptoms that may indicate other error types. This distinction affects downstream use in cleaning strategies and should be justified with a short decision tree or decision criteria.
minor comments (2)
- A single summary table listing all 35 types, their category, a one-sentence definition, and key literature references would greatly improve usability and allow readers to verify the claimed resolution of terminological inconsistencies.
- The abstract mentions 'word transpositions' as an underexplored type; ensure it appears in the catalog with its formal definition and that its placement under one of the three categories is unambiguous.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify key aspects of the catalog. We address each major point below and will incorporate revisions to strengthen the presentation of the classification and the role of error indicators.
read point-by-point responses
-
Referee: The central claim that the three categories are non-overlapping and that the 35 types are distinct is asserted by construction via the supplied definitions, yet no explicit boundary-case analysis or overlap-resolution procedure is provided. For example, 'disguised missing values' could plausibly be classified under both missing and incorrect depending on the chosen definition; this needs a dedicated subsection demonstrating mutual exclusivity.
Authors: The definitions are constructed to enforce mutual exclusivity by focusing on the primary characteristic of each error: missing errors concern the absence or non-representation of data (with disguised missing values defined as placeholders that signal absence rather than incorrect content), incorrect errors involve present values that deviate from ground truth, and redundant errors involve unnecessary repetition. While this structure supports the non-overlapping claim, we agree that an explicit boundary-case analysis would improve clarity. We will add a dedicated subsection that examines potential overlaps, including disguised missing values, and outlines a resolution procedure based on the dominant error characteristic. revision: yes
-
Referee: The inclusion of error indicators such as 'outliers' and 'bias' alongside traditional data errors (e.g., duplicate tuples, constraint violations) is load-bearing for the comprehensiveness claim, but the manuscript does not clarify whether these are treated as errors themselves or as symptoms that may indicate other error types. This distinction affects downstream use in cleaning strategies and should be justified with a short decision tree or decision criteria.
Authors: The manuscript positions outliers and bias as error indicators that are themselves catalogued as distinct types because they directly affect data quality and downstream AI pipelines, even when they may arise from or point to other issues. This is distinct from treating them solely as symptoms. To make the distinction explicit and support practical use, we will expand the Error Indicators subsection with decision criteria that differentiate primary error indicators from potential symptoms, including implications for cleaning strategies. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is a literature synthesis that compiles and formally defines 35 error types for tabular data, partitioned into the three categories of missing, incorrect, and redundant. No equations, predictions, fitted parameters, or deductive derivations appear in the work. The asserted properties of distinctness and non-overlap follow directly from the explicit definitions supplied for each type, which is a standard definitional construction rather than a reduction of any claim to its own inputs. All source material is drawn from external prior taxonomies; the paper's contribution is organization and terminological clarification, with no self-citation chains or uniqueness theorems invoked as load-bearing premises.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang
Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang
-
[2]
Detecting Data Errors: Where Are We and What Needs to Be Done?Proceedings of the VLDB Endowment (PVLDB)9, 12 (2016), 993–1004. doi:10.14778/2994509.2994518
-
[3]
Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: a survey.VLDB Journal24, 4 (2015), 557–581. doi:10.1007/ S00778-015-0389-Y
2015
-
[4]
1995.Foundations of Databases
Serge Abiteboul, Richard Hull, and Victor Vianu. 1995.Foundations of Databases. Addison-Wesley. http://webdam.inria.fr/Alice/
1995
-
[5]
Ashish Agrawal. 2011. Semantics of business process vocabulary and process rules. InProceedings of the India Software Engineering Conference (Thiruvananthapuram, Kerala, India)(ISEC). ACM, New York, NY, USA, 61–68. doi:10.1145/1953355.1953363
-
[6]
Sarah Alsufyani, Matthew Forshaw, and Sara Johansson Fernstad. 2024. Visualization of missing data: a state-of-the-art survey.CoRRabs/2410.03712 (2024). arXiv:2410.03712 doi:10.48550/ARXIV.2410.03712
-
[7]
Lyublena Antova, Christoph Koch, and Dan Olteanu. 2007. From complete to incomplete information and back. InProceedings of the International Conference on Management of Data (SIGMOD). ACM, New York, NY, USA, 713–724. doi:10.1145/1247480.1247559
-
[8]
Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. Methodologies for data quality assessment and improvement. Comput. Surveys41, 3, Article 16 (2009). doi:10.1145/1541880.1541883
-
[9]
Maria C. M. Batista and Ana Carolina Salgado. 2007. Information Quality Measurement in Data Integration Schemas. InProceedings of the International Workshop on Quality in Databases (QDB). ACM, New York, NY, USA, 61–72
2007
-
[10]
Michal Bechny, Florian Sobieczky, Jürgen Zeindl, and Lisa Ehrlinger. 2021. Missing Data Patterns: From Theory to an Application in the Steel Industry. InProceedings of the International Conference on Scientific and Statistical Database Management (SSDBM). ACM, New York, NY, USA, 214–219. doi:10.1145/3468791.3468841
-
[11]
George Beskales, Ihab F. Ilyas, and Lukasz Golab. 2010. Sampling the Repairs of Functional Dependency Violations under Hard Constraints. Proceedings of the VLDB Endowment (PVLDB)3, 1 (2010), 197–207. doi:10.14778/1920841.1920870
-
[12]
Divya Bhadauria, Alejandro Sierra Múnera, and Ralf Krestel. 2024. The Effects of Data Quality on Named Entity Recognition. InProceedings of the Workshop on Noisy and User-generated Text (W-NUT). Association for Computational Linguistics, San˙Giljan, Malta, 79–88. doi:10.18653/v1/2024.wnut- 1.8
-
[13]
Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2007. Conditional Functional Dependencies for Data Cleaning. InProceedings of the International Conference on Data Engineering, ICDE. IEEE Computer Society, Los Alamitos, CA, USA, 746–755. doi:10.1109/ICDE.2007.367920
-
[14]
Orlando Amaral Cejas, Muhammad Ilyas Azeem, Sallam Abualhaija, and Lionel C. Briand. 2023. NLP-Based Automated Compliance Checking of Data Processing Agreements Against GDPR.IEEE Transactions on Software Engineering49, 9 (2023), 4282–4303. doi:10.1109/TSE.2023.3288901
-
[15]
Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey.Comput. Surveys41, 3, Article 15 (July 2009), 58 pages. doi:10.1145/1541880.1541882
- [16]
-
[17]
Chrisman
Nicholas R. Chrisman. 1983. The Role of Quality Information in the Long-Term Functioning of a Geographic Information System.Cartographica: The International Journal for Geographic Information and Geovisualization21, 2 (1983), 79–88
1983
- [18]
-
[19]
Ilyas, Sanjay Krishnan, and Jiannan Wang
Xu Chu, Ihab F. Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data Cleaning: Overview and Emerging Challenges. InProceedings of the International Conference on Management of Data (SIGMOD). ACM, 2201–2206. doi:10.1145/2882903.2912574
-
[20]
Gianluca Cima, Marco Console, and Maurizio Lenzerini. 2025. Ontology-Based Schema-Level Data Quality: The Case of Consistency.Journal on Data and Information Quality (JDIQ)17, 4, Article 22 (dec 2025), 25 pages. doi:10.1145/3770750
-
[21]
Andrea Colagrossi, Vincenzo Pesce, Stefano Silvestrini, David Gonzalez-Arjona, Pablo Hermosin, and Matteo Battilana. 2023. Chapter Six - Sensors. InModern Spacecraft Guidance, Navigation, and Control. Elsevier, 253–336. doi:10.1016/B978-0-323-90916-7.00006-8
-
[22]
Giuseppe Colavito, Filippo Lanubile, Nicole Novielli, and Luigi Quaranta. 2024. Impact of data quality for automatic issue classification using pre-trained language models.J. Syst. Softw.210 (2024), 111838. doi:10.1016/J.JSS.2023.111838
-
[23]
Dena Cox, Jeffrey G. Cox, and Anthony D. Cox. 2017. To Err is human? How typographical and orthographical errors affect perceptions of online reviewers.Comput. Hum. Behav.75 (2017), 245–253. doi:10.1016/J.CHB.2017.05.008
-
[24]
Paul G. Curran. 2016. Methods for the detection of carelessly invalid responses in survey data.Journal of Experimental Social Psychology66 (2016), 4–19. doi:10.1016/j.jesp.2015.07.006 Rigorous and Replicable Methods in Social Psychology
-
[25]
Fred J. Damerau. 1964. A technique for computer detection and correction of spelling errors.Commun. ACM7, 3 (March 1964), 171–176. doi:10.1145/363958.363994
-
[26]
Hendrik Decker. 2011. Causes of the Violation of Integrity Constraints for Supporting the Quality of Databases. InProceedings of the International Conference on Computational Science and Its Applications (ICCSA) (Lecture Notes in Computer Science). Springer, Santander, Spain, 283–292. doi:10.1007/978-3-642-21934-4_24
-
[27]
Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qionghai Dai. 2017. Deep Direct Reinforcement Learning for Financial Signal Representation and Trading.IEEE Transactions on Neural Networks and Learning Systems28, 3 (2017), 653–664. doi:10.1109/TNNLS.2016.2522401
-
[28]
Xiaoou Ding, Hongzhi Wang, Genglong Li, Haoxuan Li, Yingze Li, and Yida Liu. 2022. IoT data cleaning techniques: A survey.Intell. Converged Networks3, 4 (2022), 325–339. doi:10.23919/ICN.2022.0026
-
[29]
Fabien Duchateau and Zohra Bellahsene. 2010. Measuring the Quality of an Integrated Schema. InProceedings of the International Conference on Conceptual Modeling (ER) (Lecture Notes in Computer Science, Vol. 6412). Springer, Vancouver, BC, Canada, 261–273. https://doi.org/10.1007/978-3- 642-16373-9_19
-
[30]
Lisa Ehrlinger, Bernhard Werth, and Wolfram Wöß. 2023. Automating Data Quality Monitoring with Reference Data Profiles. InData Management Technologies and Applications. Springer Nature Switzerland, Cham, 24–44
2023
-
[32]
Lisa Ehrlinger and Wolfram Wöß. 2018. A Novel Data Quality Metric for Minimality. InData Quality and Trust in Big Data - International Workshop, QUAT, Revised Selected Papers (Lecture Notes in Computer Science, Vol. 11235). Springer, 1–15. doi:10.1007/978-3-030-19143-6_1
-
[33]
Lisa Ehrlinger and Wolfram Wöß. 2022. A Survey of Data Quality Measurement and Monitoring Tools.Frontiers in Big Data5 (2022), 850611. doi:10.3389/FDATA.2022.850611
-
[34]
Elhoucine Elfatimi, Recep Eryigit, and Harisu Abdullahi Shehu. 2024. Impact of datasets on the effectiveness of MobileNet for beans leaf disease detection.Neural Comput. Appl.36, 4 (2024), 1773–1789. doi:10.1007/S00521-023-09187-4
-
[35]
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate Record Detection: A Survey.IEEE Transactions on Knowledge and Data Engineering (TKDE)19, 1 (2007), 1–16. doi:10.1109/TKDE.2007.250581
-
[36]
Maupong, Dimane Mpoeleng, Thabo Semong, Banyatsang Mphago, and Oteng Tabona
Tlamelo Emmanuel, Thabiso M. Maupong, Dimane Mpoeleng, Thabo Semong, Banyatsang Mphago, and Oteng Tabona. 2021. A survey on missing data in machine learning.J. Big Data8, 1 (2021), 140. doi:10.1186/S40537-021-00516-9
-
[37]
Wenfei Fan. 2015. Data Quality: From Theory to Practice.SIGMOD Rec.44, 3 (2015), 7–18. doi:10.1145/2854006.2854008
-
[38]
2012.Foundations of Data Quality Management(1 ed.)
Wenfei Fan and Floris Geerts. 2012.Foundations of Data Quality Management(1 ed.). Springer Cham. 201 pages. doi:10.1007/978-3-031-01892-3
-
[39]
Wenfei Fan, Floris Geerts, and Xibei Jia. 2008. A revival of integrity constraints for data cleaning.Proceedings of the VLDB Endowment (PVLDB)1, 2 (2008), 1522–1523. doi:10.14778/1454159.1454220
-
[40]
Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2008. Conditional functional dependencies for capturing data inconsistencies. TODS33, 2 (2008), 6:1–6:48. doi:10.1145/1366102.1366103
-
[41]
Emilio Ferrara. 2024. Fairness and Bias in Artificial Intelligence: A Brief Survey of Sources, Impacts, and Mitigation Strategies.Sci6, 1 (2024). doi:10.3390/sci6010003
-
[42]
Christian Fürber and Martin Hepp. 2011. Towards a vocabulary for data quality management in semantic web architectures. InProceedings of the International Workshop on Linked Web Data Management (LWDM). ACM, New York, NY, USA, 1–8. doi:10.1145/1966901.1966903
-
[43]
2013.Data cleaning: A practical perspective
Venkatesh Ganti and Anish Das Sarma. 2013.Data cleaning: A practical perspective. Morgan & Claypool Publishers
2013
-
[44]
Ullman, and Jennifer Widom
Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009.Database systems - the complete book (2. ed.). Pearson Education
2009
-
[45]
Mouzhi Ge and Markus Helfert. 2007. A Review of Information Quality Research - Develop a Research Agenda. InProceedings of the International Conference on Information Quality (ICIQ). MIT, Cambridge, MA, USA, 76–91
2007
-
[46]
Karloff, Flip Korn, Divesh Srivastava, and Bei Yu
Lukasz Golab, Howard J. Karloff, Flip Korn, Divesh Srivastava, and Bei Yu. 2008. On generating near-optimal tableaux for conditional functional dependencies.Proceedings of the VLDB Endowment (PVLDB)1, 1 (2008), 376–390. doi:10.14778/1453856.1453900 Manuscript submitted to ACM A Catalog of Data Errors 31
-
[47]
Gösta Grahne. 2018. Incomplete Information. InEncyclopedia of Database Systems, Second Edition. Springer, New York, NY. doi:10.1007/978-1-4614- 8265-9_1241
-
[48]
Paul W.P.J. Grefen and Peter M.G. Apers. 1993. Integrity control in relational database systems — an overview.DKE10, 2 (1993), 187–223. doi:10.1016/0169-023X(93)90008-D
-
[49]
Hassenstein and Patrizio Vanella
Max J. Hassenstein and Patrizio Vanella. 2022. Data Quality—Concepts and Problems.Encyclopedia2, 1 (2022), 498–510. doi:10.3390/ encyclopedia2010032
2022
-
[50]
Simon Hawkins, Hongxing He, Graham Williams, and Rohan Baxter. 2002. Outlier Detection Using Replicator Neural Networks. InProceedings of the International Conference on Data Warehousing and Knowledge Discovery (DaWaK). Springer, Aix-en-Provence, France, 170–180
2002
-
[51]
Ilyas, and Theodoros Rekatsinas
Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, and Theodoros Rekatsinas. 2019. HoloDetect: Few-Shot Learning for Error Detection. InProceedings of the International Conference on Management of Data (SIGMOD). ACM, 829–846. doi:10.1145/3299869.3319888
-
[52]
Mario Alberto Sosa Hidalgo, Ella Hafermalz, Wendy Günther, and Marleen Huysman. 2024. The Ongoing Quest for Complicatedness: How Data Science Practitioners Manage Their Emerging Role in Organizations. InProceedings of the European Conference on Information Systems (ECIS). AIS, Paphos, Cyprus, 1536–1552. https://aisel.aisnet.org/ecis2024/track06_humanaicol...
2024
-
[53]
Ilyas and Xu Chu
Ihab F. Ilyas and Xu Chu. 2015. Trends in cleaning relational data: Consistency and deduplication.Foundations and Trends®in Databases5, 4 (2015), 281–393
2015
-
[54]
2019.Data Cleaning
Ihab F Ilyas and Xu Chu. 2019.Data Cleaning. Morgan & Claypool
2019
-
[55]
Ilyas and Felix Naumann
Ihab F. Ilyas and Felix Naumann. 2022. Data Errors: Symptoms, Causes and Origins.IEEE Data Engineering Bulletin45, 1 (2022), 4–9. http: //sites.computer.org/debull/A22mar/p4.pdf
2022
-
[56]
Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Christopher Chute, Henrik Marklund, Behzad Haghgoo, Robyn L. Ball, Katie S. Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, and Andrew Y. Ng. 2019. CheXpert: A La...
- [57]
-
[58]
Standard
ISO 8000-8:2015(E) 2015.Data Quality – Part 8: Information and Data Quality Concepts and Measuring. Standard. International Organization for Standardization. https://www.iso.org/standard/60805.html
2015
-
[59]
Ole Guttorm Jensen and Michael H. Böhlen. 2002. Current, Legacy, and Invalid Tuples in Conditionally Evolving Databases. InProceedings of the International Conference on Advances in Information Systems (ADVIS) (Lecture Notes in Computer Science, Vol. 2457). Springer, 65–82. doi:10.1007/3-540-36077-8_7
-
[60]
John A. Johnson. 2005. Ascertaining the validity of individual protocols from Web-based personality inventories.Journal of Research in Personality 39, 1 (2005), 103–129. doi:10.1016/j.jrp.2004.09.009 Proceedings of the Association for Research in Personality
-
[61]
João Marcelo Borovina Josko. 2018. A Formal Taxonomy of Temporal Data Defects. InInternational Workshop on Data Quality and Trust (QUAT) (Lecture Notes in Computer Science, Vol. 11235). Springer, 94–110. doi:10.1007/978-3-030-19143-6_7
-
[62]
João Marcelo Borovina Josko, Marcio Katsumi Oikawa, and João Eduardo Ferreira. 2016. A Formal Taxonomy to Improve Data Defect Description. In Proceedings of the International Conference on Database Systems for Advanced Applications (DASFAA), Vol. 9645. Springer, 307–320. doi:10.1007/978- 3-319-32055-7_25
-
[63]
Daniel Jurafsky and James H. Martin. 2024.Speech and Language Processing. Prentice-Hall, Inc., Upper Saddle River, NJ, USA. https://web.stanford. edu/~jurafsky/slp3/
2024
-
[64]
Alan F. Karr, Ashish P. Sanil, and David L. Banks. 2006. Data quality: A statistical perspective.Statistical Methodology3, 2 (2006), 137–173. doi:10.1016/j.stamet.2005.08.005
-
[65]
Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin
Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin
-
[66]
InProceedings of the International Conference on Management of Data (SIGMOD)
BigDansing: A System for Big Data Cleansing. InProceedings of the International Conference on Management of Data (SIGMOD). ACM, 1215–1230. doi:10.1145/2723372.2747646
-
[67]
Kim and William E
Jay J. Kim and William E. Winkler. 2003.Multiplicative Noise for Masking Continuous Data. Research Report 2003-01. Statistical Research Division, U.S. Census Bureau, Washington, DC. https://www.census.gov/library/working-papers/2003/adrm/rrs2003-01.html
2003
-
[68]
Won Kim, Byoung-Ju Choi, Eui-Kyeong Hong, Soo-Kyung Kim, and Doheon Lee. 2003. A Taxonomy of Dirty Data.Data Mining and Knowledge Discovery7, 1 (2003), 81–99. doi:10.1023/A:1021564703268
-
[69]
Ido Kissos and Nachum Dershowitz. 2016. OCR Error Correction Using Character Correction and Feature-Based Word Classification. InProceedings of the IAPR Workshop on Document Analysis Systems (DAS). IEEE Computer Society, 198–203. doi:10.1109/DAS.2016.44
-
[70]
Solmaz Kolahi. 2009.Multivalued Dependency. Springer US, Boston, MA, 1865–1865. doi:10.1007/978-0-387-39940-9_1248
-
[71]
Karen Kukich. 1992. Techniques for automatically correcting words in text.Comput. Surveys24, 4 (1992), 377–439. doi:10.1145/146370.146380
-
[72]
Kurapati, Antonio Yaghy, and Aakriti G
Sai S. Kurapati, Antonio Yaghy, and Aakriti G. Shukla. 2025. Data bias: ethical considerations for understanding diversity in medical artificial intelligence.AI Ethics5, 3 (2025), 3399–3405. doi:10.1007/S43681-024-00589-1
-
[73]
Public Law and An Act. 2002. Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled, Sarbanes-Oxley Act of 2002.Public Law107 (2002), 204. Manuscript submitted to ACM 32 Bhadauria et al
2002
-
[74]
Yang W. Lee, Leo L. Pipino, James D. Funk, and Richard Y. Wang. 2006.Understanding the Anatomy of Data Quality Problems and Patterns. MIT Press, 79–108. doi:10.7551/mitpress/4037.003.0008
-
[75]
Liakos, Patrizia Busato, Dimitrios Moshou, Simon Pearson, and Dionysis Bochtis
Konstantinos G. Liakos, Patrizia Busato, Dimitrios Moshou, Simon Pearson, and Dionysis Bochtis. 2018. Machine Learning in Agriculture: A Review.Sensors18, 8 (2018). doi:10.3390/s18082674
-
[76]
Roderick J. A. Little and Donald B. Rubin. 1983. Missing data in large data sets. InStatistical Methods and the Improvement of Data Quality. Academic Press, 215–243. doi:10.1016/B978-0-12-765480-5.50017-5
-
[77]
Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A Configuration-Free Error Detection System. InProceedings of the International Conference on Management of Data (SIGMOD). ACM, Amsterdam, 865–882
2019
-
[78]
Eric Mays, Fred J. Damerau, and Robert L. Mercer. 1991. Context based spelling correction.Inf. Process. Manag.27, 5 (1991), 517–522. doi:10.1016/0306- 4573(91)90066-U
-
[79]
Arturas Mazeika and Michael H. Böhlen. 2006. Cleansing Databases of Misspelled Proper Nouns. InProceedings of the Int’l VLDB Workshop on Clean Databases
2006
-
[80]
2005.The New Oxford American Dictionary
Erin McKean. 2005.The New Oxford American Dictionary. Vol. 2. Oxford University Press New York, Oxford, UK
2005
-
[81]
Michael Minock, Daniel Oskarsson, Björn Pelzer, and Mika Cohen. 2015. Natural Language Specification and Violation Reporting of Business Rules over ER-modeled Databases. InProceedings of the International Conference on Extending Database Technology, EDBT. OpenProceedings.org, 541–544. doi:10.5441/002/EDBT.2015.53
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.