TY - GEN
T1 - Outlier detection forest for large-scale categorical data sets
AU - Sun, Zhipeng
AU - Du, Hongwei
AU - Ye, Qiang
AU - Liu, Chuang
AU - Kibenge, Patricia Lilian
AU - Huang, Hui
AU - Li, Yuying
N1 - Publisher Copyright:
© Springer Nature Switzerland AG 2019.
PY - 2019
Y1 - 2019
N2 - Outlier detection is one of the most important data mining problems, which has attracted much attention over the past years. So far, there have been a variety of different schemes for outlier detection. However, most of the existing methods work with numeric data sets. And these methods cannot be directly applied to categorical data sets because it is not straightforward to define a practical similarity measure for categorical data. Furthermore, the existing outlier detection schemes that are tailored for categorical data tend to result in poor scalability, which makes them infeasible for large-scale data sets. In this paper, we propose a tree-based outlier detection algorithm for large-scale categorical data sets, Outlier Detection Forest (ODF). Our experimental results indicate that, compared with the state-of-the-art outlier detection schemes, ODF can achieve the same level of outlier detection precision and much better scalability.
AB - Outlier detection is one of the most important data mining problems, which has attracted much attention over the past years. So far, there have been a variety of different schemes for outlier detection. However, most of the existing methods work with numeric data sets. And these methods cannot be directly applied to categorical data sets because it is not straightforward to define a practical similarity measure for categorical data. Furthermore, the existing outlier detection schemes that are tailored for categorical data tend to result in poor scalability, which makes them infeasible for large-scale data sets. In this paper, we propose a tree-based outlier detection algorithm for large-scale categorical data sets, Outlier Detection Forest (ODF). Our experimental results indicate that, compared with the state-of-the-art outlier detection schemes, ODF can achieve the same level of outlier detection precision and much better scalability.
KW - Big data
KW - Categorical data
KW - Entropy
KW - Outlier detection
UR - https://www.scopus.com/pages/publications/85077775923
U2 - 10.1007/978-3-030-34980-6_4
DO - 10.1007/978-3-030-34980-6_4
M3 - 会议稿件
AN - SCOPUS:85077775923
SN - 9783030349790
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 45
EP - 56
BT - Computational Data and Social Networks - 8th International Conference, CSoNet 2019, Proceedings
A2 - Tagarelli, Andrea
A2 - Tong, Hanghang
PB - Springer
T2 - 8th International Conference on Computational Data and Social Networks, CSoNet 2019
Y2 - 18 November 2019 through 20 November 2019
ER -