Abstract
String similarity processing is well adopted in various fields such as data cleaning, information integration, spelling check and bioinformatics. With the rapid growth of information, processing strings over massive datasets becomes a significant basic operation. Existing solutions based on q-Gram and inverted lists suffer from the following disadvantages: large storage cost, poor efficiency of updates and limited types of operation. At meanwhile, most of these methods are main memory based. Bed-tree is a recently proposed tree structured, disk resident index which supports diverse operations. Bed-tree fails to cluster similar strings together, thus incurs large I/O costs while string similarity processing. This paper proposes a disk resident index RM-tree which clusters similar strings together while eliminating the shortcomings of q-Gram and inverted list based solutions. RM-tree supports multiple types of string similarity operations such as range query, top-k query and string similarity join. This paper presents the construction method of RM-tree and its query processing algorithms. RM-tree is tested with extensive experiments including index construction and different types of query processing in terms of I/O cost and time consumption with two real world datasets and a synthetic dataset. Experiment results show that RM-tree is efficient for supporting string similarity operations, and provides better performance than Bed-tree.
| Original language | English |
|---|---|
| Pages (from-to) | 2142-2154 |
| Number of pages | 13 |
| Journal | Jisuanji Xuebao/Chinese Journal of Computers |
| Volume | 34 |
| Issue number | 11 |
| DOIs | |
| State | Published - Nov 2011 |
Keywords
- Index
- Join processing
- Query processing
- Similarity
- String
Fingerprint
Dive into the research topics of 'RM-tree: An index supporting string similarity operations'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver