Abstract
String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.
| Original language | English |
|---|---|
| Pages (from-to) | 1499-1510 |
| Number of pages | 12 |
| Journal | Frontiers of Information Technology and Electronic Engineering |
| Volume | 18 |
| Issue number | 10 |
| DOIs | |
| State | Published - 1 Oct 2017 |
| Externally published | Yes |
Keywords
- Combined frequency vectors
- Data partition
- Edit distance
- Filter and refine
- String similarity join
Fingerprint
Dive into the research topics of 'FrepJoin: an efficient partition-based algorithm for edit similarity join'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver