Skip to main navigation Skip to search Skip to main content

RM-tree: An index supporting string similarity operations

  • School of Computer Science and Technology, Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

String similarity processing is well adopted in various fields such as data cleaning, information integration, spelling check and bioinformatics. With the rapid growth of information, processing strings over massive datasets becomes a significant basic operation. Existing solutions based on q-Gram and inverted lists suffer from the following disadvantages: large storage cost, poor efficiency of updates and limited types of operation. At meanwhile, most of these methods are main memory based. Bed-tree is a recently proposed tree structured, disk resident index which supports diverse operations. Bed-tree fails to cluster similar strings together, thus incurs large I/O costs while string similarity processing. This paper proposes a disk resident index RM-tree which clusters similar strings together while eliminating the shortcomings of q-Gram and inverted list based solutions. RM-tree supports multiple types of string similarity operations such as range query, top-k query and string similarity join. This paper presents the construction method of RM-tree and its query processing algorithms. RM-tree is tested with extensive experiments including index construction and different types of query processing in terms of I/O cost and time consumption with two real world datasets and a synthetic dataset. Experiment results show that RM-tree is efficient for supporting string similarity operations, and provides better performance than Bed-tree.

Original languageEnglish
Pages (from-to)2142-2154
Number of pages13
JournalJisuanji Xuebao/Chinese Journal of Computers
Volume34
Issue number11
DOIs
StatePublished - Nov 2011

Keywords

  • Index
  • Join processing
  • Query processing
  • Similarity
  • String

Fingerprint

Dive into the research topics of 'RM-tree: An index supporting string similarity operations'. Together they form a unique fingerprint.

Cite this