Skip to main navigation Skip to search Skip to main content

An improved random forest approach for detection of hidden Web search interfaces

  • Harbin Institute of Technology Shenzhen
  • The University of Hong Kong

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Search interface detection is an essential technique for extracting information from the hidden Web. The challenge for this task is search interface data that is represented in high dimensional and sparse features with many missing values. This paper presents a new multi-classifier ensemble approach to solving this problem. In this approach, we have extended the random forest algorithm with a weighted feature selection method to build individual classifiers. With this improved random forest algorithm (IRFA), each classifier can be learnt from a weighted subset of the feature space so that the ensemble of decision trees can fully exploit the useful features of search interface patterns. We have compared our ensemble approach with other well-known classification algorithms, such as SVM and C4.S. The experimental results have shown that our method is more effective in detecting search interfaces of the hidden Web.

Original languageEnglish
Title of host publicationProceedings of the 7th International Conference on Machine Learning and Cybernetics, ICMLC
Pages1586-1591
Number of pages6
DOIs
StatePublished - 2008
Externally publishedYes
Event7th International Conference on Machine Learning and Cybernetics, ICMLC - Kunming, China
Duration: 12 Jul 200815 Jul 2008

Publication series

NameProceedings of the 7th International Conference on Machine Learning and Cybernetics, ICMLC
Volume3

Conference

Conference7th International Conference on Machine Learning and Cybernetics, ICMLC
Country/TerritoryChina
CityKunming
Period12/07/0815/07/08

Keywords

  • Form classification
  • Hidden Web
  • Random forest
  • Search interface detection

Fingerprint

Dive into the research topics of 'An improved random forest approach for detection of hidden Web search interfaces'. Together they form a unique fingerprint.

Cite this