Skip to main navigation Skip to search Skip to main content

Compensation strategy of unseen feature words in naive Bayes text classification

  • School of Management, Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

When applied to deal with text classification task, naive Bayes is always suffered from the unseen feature words problem. Moreover, this problem is hardly to be solved by expanding the corpora for there is the sparse data problem in the corpora, in which the distribution of words complies with Zipf law. Inspired by statistical language model, a novel approach is proposed, which applies the smoothing algorithms to naive Bayes for text classification task to overcome the unseen feature words problem. The experimental corpora come from the data in National 863 Evaluation on text classification, and in the open test with removing the stop words, the naive Bayes performance with Good-Turing algorithm is 3.05% higher than that with Laplace, and 1.00% higher than that with Lidstone. And in the experiment with cross entropy extracting feature words, the performance of naive Bayes with Good-Turing algorithm is even 1.95% higher than that of Maximum Entropy model. The smoothing algorithm is helpful to solve the unseen feature words problem due to the sparse data.

Original languageEnglish
Pages (from-to)956-960
Number of pages5
JournalHarbin Gongye Daxue Xuebao/Journal of Harbin Institute of Technology
Volume40
Issue number6
StatePublished - Jun 2008

Keywords

  • Data smoothing
  • Naive Bayes classification
  • Text classification
  • Unseen feature words

Fingerprint

Dive into the research topics of 'Compensation strategy of unseen feature words in naive Bayes text classification'. Together they form a unique fingerprint.

Cite this