Skip to main navigation Skip to search Skip to main content

A forwarding-based task scheduling algorithm for distributed web crawling over DHTs

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Distributed Web crawling (DWC) over DHTs is proposed to solve the bottlenecks in the traditional Web crawling. The core of this kind of system is its fully distributed task scheduling mechanism in which the crawlers are treated as peers and the crawlees are treated as resources maintained by the peers. A system model based on the Content Addressable Network (CAN) can further optimize the scheduling mechanism by exploiting the network proximity of the crawlers and the crawlees. In this paper, we propose a new method for CAN in order to achieve load balancing in the CAN-based DWC system. The method not only keeps the load balancing among peers but also keeps the distance between peers and resources very short in our simulations. The shortened peer-resource distance fulfills the need of shortening crawler-crawlee latencies.

Original languageEnglish
Title of host publicationICPADS '09 - 15th International Conference on Parallel and Distributed Systems
Pages854-859
Number of pages6
DOIs
StatePublished - 2009
Externally publishedYes
Event15th International Conference on Parallel and Distributed Systems, ICPADS '09 - Shenzhen, Guangdong, China
Duration: 8 Dec 200911 Dec 2009

Publication series

NameProceedings of the International Conference on Parallel and Distributed Systems - ICPADS
ISSN (Print)1521-9097

Conference

Conference15th International Conference on Parallel and Distributed Systems, ICPADS '09
Country/TerritoryChina
CityShenzhen, Guangdong
Period8/12/0911/12/09

Keywords

  • Content addressable network
  • DHT
  • Distributed web crawling
  • Task scheduling

Fingerprint

Dive into the research topics of 'A forwarding-based task scheduling algorithm for distributed web crawling over DHTs'. Together they form a unique fingerprint.

Cite this