Skip to main navigation Skip to search Skip to main content

Processing string similarity search in external memory efficiently

  • School of Computer Science and Technology, Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

String similarity search is fundamental to various applications, such as data cleaning, spell checking, bioinformatics and information integration, since users tend to provide fuzzy queries in these applications due to input errors of both queried strings and data strings. With the rapid growth of data size, big string datasets become ubiquitous, and almost every modern information system stores data presented in string format. String similarity search should be well supported in modern information systems. Memory based q-gram inverted indexes fail to support string similarity search over big string datasets due to the memory constraint, and it can no longer work if the data size grows larger than memory size. Existing external memory method, Behm-Index, only supports length-filter and prefix filter. This paper proposes LPA-Index to reduce I/O cost for better query response time. It is a disk resident index which suffers no limitation on data size compared to memory size. LPA-Index supports both length-filter and positional filter which reduce query candidates efficiently, and it selectively reads inverted lists during query processing for better I/O performance. Experiment results on both real datasets and a synthetic dataset demonstrate the efficiency of LPA-Index and its advantages over existing state of art disk index Behm-Index with regard to I/O cost and query response time.

Original languageEnglish
Pages (from-to)738-748
Number of pages11
JournalJisuanji Yanjiu yu Fazhan/Computer Research and Development
Volume52
Issue number3
DOIs
StatePublished - 1 Mar 2015

Keywords

  • Edit distance
  • External memory
  • Query processing
  • Similarity search
  • String

Fingerprint

Dive into the research topics of 'Processing string similarity search in external memory efficiently'. Together they form a unique fingerprint.

Cite this