Skip to main navigation Skip to search Skip to main content

FrepJoin: an efficient partition-based algorithm for edit similarity join

  • School of Computer Science and Technology, Harbin Institute of Technology
  • Shenzhen University

Research output: Contribution to journalArticlepeer-review

Abstract

String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.

Original languageEnglish
Pages (from-to)1499-1510
Number of pages12
JournalFrontiers of Information Technology and Electronic Engineering
Volume18
Issue number10
DOIs
StatePublished - 1 Oct 2017
Externally publishedYes

Keywords

  • Combined frequency vectors
  • Data partition
  • Edit distance
  • Filter and refine
  • String similarity join

Fingerprint

Dive into the research topics of 'FrepJoin: an efficient partition-based algorithm for edit similarity join'. Together they form a unique fingerprint.

Cite this