Skip to main navigation Skip to search Skip to main content

Crowd-Guided Entity Matching with Consolidated Textual Data

  • Zhi Xu Li
  • , Qiang Yang
  • , An Liu*
  • , Guan Feng Liu
  • , Jia Zhu
  • , Jia Jie Xu
  • , Kai Zheng
  • , Min Zhang
  • *Corresponding author for this work
  • Soochow University
  • Guangdong Key Laboratory of Big Data Analysis and Processing
  • South China Normal University
  • Beijing Key Laboratory of Big Data Management and Analysis Methods

Research output: Contribution to journalArticlepeer-review

Abstract

Entity matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more databases may have some unstructured textual attribute containing extra consolidated textual information (CText) of the record, but seldom work has been done on using the CText for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CText since there are hundreds or thousands of words with each piece of CText, while existing topic models either cannot work well since there are no obvious gaps between topics in CText. In this paper, we propose a novel cooccurrence-based topic model to identify various sub-topics from each piece of CText, and then measure the similarity between CText on the multiple sub-topic dimensions. To avoid ignoring some hidden important sub-topics, we let the crowd help us decide weights of different sub-topics in doing EM. Our empirical study on two real-world datasets based on Amzon Mechanical Turk Crowdsourcing Platform shows that our method outperforms the state-of-the-art EM methods and Text Understanding models.

Original languageEnglish
Pages (from-to)858-876
Number of pages19
JournalJournal of Computer Science and Technology
Volume32
Issue number5
DOIs
StatePublished - 1 Sep 2017
Externally publishedYes

Keywords

  • consolidated textual data
  • crowdsourcing
  • entity matching

Fingerprint

Dive into the research topics of 'Crowd-Guided Entity Matching with Consolidated Textual Data'. Together they form a unique fingerprint.

Cite this