Skip to main navigation Skip to search Skip to main content

A block segmentation based approach for web information extraction

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This paper addresses the issue of web information extraction to support automatic teacher information management. We propose an effective approach based on block segmentation. First, the teacher introduction web pages are divided into independent blocks, where html tags and punctuation marks are used as segmentation criterion. Then CRF model is employed to label the text. We apply this approach on a teacher web page dataset collected from heterogeneous sources. Experimental results indicate that for basic info and contact info extraction our approach achieves an accurate result just using word level features. As extending value features related to education to block level, the performance of our system on the complex educational information extraction task is dramatically improved.

Original languageEnglish
Title of host publicationProceedings - 2010 International Conference on Asian Language Processing, IALP 2010
Pages154-157
Number of pages4
DOIs
StatePublished - 2010
Externally publishedYes
Event2010 International Conference on Asian Language Processing, IALP 2010 - Harbin, China
Duration: 28 Dec 201030 Dec 2010

Publication series

NameProceedings - 2010 International Conference on Asian Language Processing, IALP 2010

Conference

Conference2010 International Conference on Asian Language Processing, IALP 2010
Country/TerritoryChina
CityHarbin
Period28/12/1030/12/10

Keywords

  • Block segmentation
  • CRF
  • Information extraction

Fingerprint

Dive into the research topics of 'A block segmentation based approach for web information extraction'. Together they form a unique fingerprint.

Cite this