Skip to main navigation Skip to search Skip to main content

Segmentation of mixed Chinese/English document including scattered italic characters

  • Yong Xia*
  • , Chun Heng Wang
  • , Ru Wei Dai
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

It is difficult to segment mixed Chinese/English documents when there are many italic characters scattered in documents. Most contributions attach more attention to English documents. However, mixed document is different from English document and some special features should be considered. This paper gives a new way to solve the problem. At first, an appropriate character area is chosen to detect italic. Next, a two-step strategy is adopted. Italic determination is done first and then if the character pattern is identified as italic, the estimation of slant angle will be done. Finally the italic character pattern is corrected by shear transform. A method of adopting two-step weighted projection profile histogram for italic determination is introduced. And a fast algorithm to estimate slant angle is also introduced. Three large sample collections, including character and character-pair and document respectively, are provided to evaluate our method and encouraging results are achieved.

Original languageEnglish
Title of host publicationComputer Processing of Oriental Languages - Beyond the Orient
Subtitle of host publicationThe Research Challenges Ahead - 21st International Conference, ICCPOL 2006, Proceedings
Pages13-21
Number of pages9
DOIs
StatePublished - 2006
Externally publishedYes
Event21st International Conference on Computer Processing of Oriental Languages: Beyond the Orient: The Research Challenges Ahead, ICCPOL 2006 - Singapore, Singapore
Duration: 17 Dec 200619 Dec 2006

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4285 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference21st International Conference on Computer Processing of Oriental Languages: Beyond the Orient: The Research Challenges Ahead, ICCPOL 2006
Country/TerritorySingapore
CitySingapore
Period17/12/0619/12/06

Fingerprint

Dive into the research topics of 'Segmentation of mixed Chinese/English document including scattered italic characters'. Together they form a unique fingerprint.

Cite this