Skip to main navigation Skip to search Skip to main content

Outlier detection forest for large-scale categorical data sets

  • Zhipeng Sun
  • , Hongwei Du*
  • , Qiang Ye
  • , Chuang Liu
  • , Patricia Lilian Kibenge
  • , Hui Huang
  • , Yuying Li
  • *Corresponding author for this work
  • Harbin Institute of Technology Shenzhen
  • Dalhousie University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Outlier detection is one of the most important data mining problems, which has attracted much attention over the past years. So far, there have been a variety of different schemes for outlier detection. However, most of the existing methods work with numeric data sets. And these methods cannot be directly applied to categorical data sets because it is not straightforward to define a practical similarity measure for categorical data. Furthermore, the existing outlier detection schemes that are tailored for categorical data tend to result in poor scalability, which makes them infeasible for large-scale data sets. In this paper, we propose a tree-based outlier detection algorithm for large-scale categorical data sets, Outlier Detection Forest (ODF). Our experimental results indicate that, compared with the state-of-the-art outlier detection schemes, ODF can achieve the same level of outlier detection precision and much better scalability.

Original languageEnglish
Title of host publicationComputational Data and Social Networks - 8th International Conference, CSoNet 2019, Proceedings
EditorsAndrea Tagarelli, Hanghang Tong
PublisherSpringer
Pages45-56
Number of pages12
ISBN (Print)9783030349790
DOIs
StatePublished - 2019
Externally publishedYes
Event8th International Conference on Computational Data and Social Networks, CSoNet 2019 - Ho Chi Minh City, Viet Nam
Duration: 18 Nov 201920 Nov 2019

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11917 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference8th International Conference on Computational Data and Social Networks, CSoNet 2019
Country/TerritoryViet Nam
CityHo Chi Minh City
Period18/11/1920/11/19

Keywords

  • Big data
  • Categorical data
  • Entropy
  • Outlier detection

Fingerprint

Dive into the research topics of 'Outlier detection forest for large-scale categorical data sets'. Together they form a unique fingerprint.

Cite this