Skip to main navigation Skip to search Skip to main content

Chinese new word extraction from MicroBlog data

  • School of Computer Science and Technology, Harbin Institute of Technology

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Chinese new word extraction is an important task in Chinese natural language processing and MicroBlog has become a main place of new words' creation and dissemination. Although many effective methods have been proposed, there is a lack of research on Internet texts especially MicroBlog texts. In this paper, we study the MicroBlog-oriented method for new word extraction. Firstly we analyze the performance of classical statistical measures in extracting new words from MicroBlog texts. Secondly we base our work on Branch Entropy. For the shortcomings of statistical measures and the characteristics of MicroBlog texts, we propose a modified method. Experimental result demonstrates that our method is feasible and effective. Lastly, we show four types of new words extracted from MicroBlog.

Original languageEnglish
Title of host publicationProceedings - International Conference on Machine Learning and Cybernetics
PublisherIEEE Computer Society
Pages1874-1879
Number of pages6
ISBN (Electronic)9781479902576
DOIs
StatePublished - 2013
Externally publishedYes
Event12th International Conference on Machine Learning and Cybernetics, ICMLC 2013 - Tianjin, China
Duration: 14 Jul 201317 Jul 2013

Publication series

NameProceedings - International Conference on Machine Learning and Cybernetics
Volume4
ISSN (Print)2160-133X
ISSN (Electronic)2160-1348

Conference

Conference12th International Conference on Machine Learning and Cybernetics, ICMLC 2013
Country/TerritoryChina
CityTianjin
Period14/07/1317/07/13

Keywords

  • Branch entropy
  • MicroBlog
  • Natural language processing
  • New word extraction
  • Statistical measure

Fingerprint

Dive into the research topics of 'Chinese new word extraction from MicroBlog data'. Together they form a unique fingerprint.

Cite this