Abstract
Segmenting web pages into small modules that match user's intuitive sense is an important preprocessing step in mobile device browsing, information retrieval data extraction applications. Traditional page segmentation algorithms usually exploit some heuristic information of page content and DOM tree, such as visual clues or attributes (tags) in DOM trees, but ignore some useful features contained in both sub-tree structures of DOM trees and the semantic content of pages, which in turn leads to poor performance in segmentation of complex web pages. In this paper, we present a novel unsupervised page segmentation algorithm, i.e. TPS, to exploit richer features in DOM trees. This algorithm can successfully bridge the gap between the DOM structure and the semantic modules, and identify modules by mining the sub-tree structures of DOM trees. Experimental results on various web pages demonstrate that TPS has better performance than start-of-the-art algorithm VIPS.
| Original language | English |
|---|---|
| Pages (from-to) | 387-394 |
| Number of pages | 8 |
| Journal | Information |
| Volume | 15 |
| Issue number | 1 |
| State | Published - Jan 2012 |
| Externally published | Yes |
Keywords
- DOM tree
- Page segmentation
- Web structure mining
Fingerprint
Dive into the research topics of 'TPS: An unsupervised web page segmentation algorithm based on DOM tree structure mining'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver