Skip to main navigation Skip to search Skip to main content

Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation

  • School of Computer Science and Technology, Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Data quality issues have attracted widespread attentions due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate model with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent, and conflicting data on classification and clustering models. From the experimental results, we observe that dirty-data impacts are related to the error type, the error rate, and the data size. Based on the findings, we suggest users leverage our proposed metrics, sensibility and data quality inflection point, for model selection and data cleaning.

Original languageEnglish
Pages (from-to)806-821
Number of pages16
JournalJournal of Computer Science and Technology
Volume36
Issue number4
DOIs
StatePublished - Jul 2021
Externally publishedYes

Keywords

  • classification
  • clustering
  • data cleaning
  • data quality
  • model selection

Fingerprint

Dive into the research topics of 'Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation'. Together they form a unique fingerprint.

Cite this