Skip to main navigation Skip to search Skip to main content

A unified tagging approach to text normalization

  • Tsinghua University
  • Microsoft USA
  • National University of Singapore

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This paper addresses the issue of text normalization, an important yet often overlooked problem in natural language processing. By text normalization, we mean converting 'informally inputted' text into the canonical form, by eliminating 'noises' in the text and detecting paragraph and sentence boundaries in the text. Previously, text normalization issues were often undertaken in an ad-hoc fashion or studied separately. This paper first gives a formalization of the entire problem. It then proposes a unified tagging approach to perform the task using Conditional Random Fields (CRF). The paper shows that with the introduction of a small set of tags, most of the text normalization tasks can be performed within the approach. The accuracy of the proposed method is high, because the subtasks of normalization are interdependent and should be performed together. Experimental results on email data cleaning show that the proposed method significantly outperforms the approach of using cascaded models and that of employing independent models.

Original languageEnglish
Title of host publicationACL 2007 - Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics
Pages688-695
Number of pages8
StatePublished - 2007
Event45th Annual Meeting of the Association for Computational Linguistics, ACL 2007 - Prague, Czech Republic
Duration: 23 Jun 200730 Jun 2007

Publication series

NameACL 2007 - Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics

Conference

Conference45th Annual Meeting of the Association for Computational Linguistics, ACL 2007
Country/TerritoryCzech Republic
CityPrague
Period23/06/0730/06/07

Fingerprint

Dive into the research topics of 'A unified tagging approach to text normalization'. Together they form a unique fingerprint.

Cite this