Skip to main navigation Skip to search Skip to main content

Information classification and extraction on official web pages of organizations

  • Jinlin Wang
  • , Xing Wang*
  • , Hongli Zhang
  • , Binxing Fang
  • , Yuchen Yang
  • , Jianan Liu
  • *Corresponding author for this work
  • School of Computer Science and Technology, Harbin Institute of Technology
  • China Electronic Equipment System Engineering Company

Research output: Contribution to journalArticlepeer-review

Abstract

As a real-time and authoritative source, the official Web pages of organizations contain a large amount of information. The diversity of Web content and format makes it essential for pre-processing to get the unified attributed data, which has the value of organizational analysis and mining. The existing research on dealing with multiple Web scenarios and accuracy performance is insufficient. This paper aims to propose a method to transform organizational official Web pages into the data with attributes. After locating the active blocks in the Web pages, the structural and content features are proposed to classify information with the specific model. The extraction methods based on trigger lexicon and LSTM (Long Short-Term Memory) are proposed, which efficiently process the classified information and extract data that matches the attributes. Finally, an accurate and efficient method to classify and extract information from organizational official Web pages is formed. Experimental results show that our approach improves the performing indicators and exceeds the level of state of the art on real data set from organizational official Web pages.

Original languageEnglish
Pages (from-to)2057-2073
Number of pages17
JournalComputers, Materials and Continua
Volume64
Issue number3
DOIs
StatePublished - 30 Jun 2020
Externally publishedYes

Keywords

  • Data extraction
  • Feature classification
  • LSTM
  • Trigger lexicon
  • Web pre-process

Fingerprint

Dive into the research topics of 'Information classification and extraction on official web pages of organizations'. Together they form a unique fingerprint.

Cite this