Skip to main navigation Skip to search Skip to main content

iSurfer: A focused web crawler based on incremental learning from positive samples

  • Yunming Ye*
  • , Fanyuan Ma
  • , Yiming Lu
  • , Matthew Chiu
  • , Joshua Huang
  • *Corresponding author for this work
  • Shanghai Jiao Tong University
  • The University of Hong Kong

Research output: Contribution to journalArticlepeer-review

Abstract

This paper presents a focused Web crawling system iSurfer for information retrieval from the Web. Different from other focused crawlers, iSurfer uses an incremental method to learn a page classification model and a link prediction model. It employs an online sample detector to incrementally distill new samples from crawled Web pages for online updating of the model learned. Other focused crawling systems use classifiers that are built from initial positive and negative samples and can not learn incrementally. The performances of these classifiers depend on the topical coverage of the initial positive and negative samples. However, the initial samples, particularly the negative ones, with a good coverage of target topics are difficult to find. Therefore, the iSurfer's incremental learning strategy has an advantage. It starts from a few positive samples and gains more integrated knowledge about the target topics over time. Our experiments on various topics have demonstrated that the incremental learning method can improve the harvest rate with a few initial samples.

Original languageEnglish
Pages (from-to)122-134
Number of pages13
JournalLecture Notes in Computer Science
Volume3007
DOIs
StatePublished - 2004
Externally publishedYes

Keywords

  • Focused Crawler
  • Incremental Learning
  • Link Prediction
  • Positive Sample Based Learning
  • Web Page Classification

Fingerprint

Dive into the research topics of 'iSurfer: A focused web crawler based on incremental learning from positive samples'. Together they form a unique fingerprint.

Cite this