Skip to main navigation Skip to search Skip to main content

SDLER: stacked dedupe learning for entity resolution in big data era

  • Alladoumbaye Ngueilbaye
  • , Hongzhi Wang*
  • , Daouda Ahmat Mahamat
  • , Ibrahim A. Elgendy
  • *Corresponding author for this work
  • School of Computer Science and Technology, Harbin Institute of Technology
  • Université de N’Djamena (Tchad)
  • Menoufia University

Research output: Contribution to journalArticlepeer-review

Abstract

In the Big Data Era, Entity Resolution (ER) faces many challenges such as high scalability, the coexistence of complex similarity metrics, tautonymy and synonym, and the requirement of Data Quality Evaluation. Moreover, despite more than seventy years of development efforts, there is still a high demand for democratizing ER to reduce human participation in tuning parameters, data labeling, defining blocking functions, and feature engineering. This study aimed to explore a novel Stacked Dedupe Learning ER system with high accuracy and efficiency. The study evaluated sophisticated composition methods, such as Bidirectional Recurrent Neural Networks (BiRNNs) and Long Short-Term Memory (LSTM) hidden units, to renovate each tuple to word representation distribution in a sense to capture similarities amidst tuples. Also, pre-trained words embedding where they were not available, ways to learn and tune Word Representation Distribution customized for ER tasks under different scenarios were considered. More so, the Locality Sensitive Hashing (LSH) based blocking approach, which considered the entire attributes of a tuple and produced slighter blocks, compared with traditional methods with few attributes, were assessed. The algorithm was tested on multiple datasets namely benchmarks, and multi-lingual data. The experimental results showed that Stacked Dedupe Learning achieves high quality and good performance, and scales well compared to the existing solutions.

Original languageEnglish
Pages (from-to)10959-10983
Number of pages25
JournalJournal of Supercomputing
Volume77
Issue number10
DOIs
StatePublished - Oct 2021
Externally publishedYes

Keywords

  • Bidirectional RNN
  • Big data
  • Data quality
  • Entity resolution
  • Stacked Dedupe Learning (SDL)
  • Word Representation Distribution (WRD)

Fingerprint

Dive into the research topics of 'SDLER: stacked dedupe learning for entity resolution in big data era'. Together they form a unique fingerprint.

Cite this