Abstract
When applied to deal with text classification task, naive Bayes is always suffered from the unseen feature words problem. Moreover, this problem is hardly to be solved by expanding the corpora for there is the sparse data problem in the corpora, in which the distribution of words complies with Zipf law. Inspired by statistical language model, a novel approach is proposed, which applies the smoothing algorithms to naive Bayes for text classification task to overcome the unseen feature words problem. The experimental corpora come from the data in National 863 Evaluation on text classification, and in the open test with removing the stop words, the naive Bayes performance with Good-Turing algorithm is 3.05% higher than that with Laplace, and 1.00% higher than that with Lidstone. And in the experiment with cross entropy extracting feature words, the performance of naive Bayes with Good-Turing algorithm is even 1.95% higher than that of Maximum Entropy model. The smoothing algorithm is helpful to solve the unseen feature words problem due to the sparse data.
| Original language | English |
|---|---|
| Pages (from-to) | 956-960 |
| Number of pages | 5 |
| Journal | Harbin Gongye Daxue Xuebao/Journal of Harbin Institute of Technology |
| Volume | 40 |
| Issue number | 6 |
| State | Published - Jun 2008 |
Keywords
- Data smoothing
- Naive Bayes classification
- Text classification
- Unseen feature words
Fingerprint
Dive into the research topics of 'Compensation strategy of unseen feature words in naive Bayes text classification'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver