Skip to main navigation Skip to search Skip to main content

A review of recent advances in visual speech decoding

  • Ziheng Zhou*
  • , Guoying Zhao
  • , Xiaopeng Hong
  • , Matti Pietikäinen
  • *Corresponding author for this work
  • University of Oulu

Research output: Contribution to journalReview articlepeer-review

Abstract

Visual speech information plays an important role in automatic speech recognition (ASR) especially when audio is corrupted or even inaccessible. Despite the success of audio-based ASR, the problem of visual speech decoding remains widely open. This paper provides a detailed review of recent advances in this research area. In comparison with the previous survey [97] which covers the whole ASR system that uses visual speech information, we focus on the important questions asked by researchers and summarize the recent studies that attempt to answer them. In particular, there are three questions related to the extraction of visual features, concerning speaker dependency, pose variation and temporal information, respectively. Another question is about audio-visual speech fusion, considering the dynamic changes of modality reliabilities encountered in practice. In addition, the state-of-the-art on facial landmark localization is briefly introduced in this paper. Those advanced techniques can be used to improve the region-of-interest detection, but have been largely ignored when building a visual-based ASR system. We also provide details of audio-visual speech databases. Finally, we discuss the remaining challenges and offer our insights into the future research on visual speech decoding.

Original languageEnglish
Pages (from-to)590-605
Number of pages16
JournalImage and Vision Computing
Volume32
Issue number9
DOIs
StatePublished - Sep 2014
Externally publishedYes

Keywords

  • Audio-visual speech recognition
  • Automatic speech recognition
  • Lip-reading
  • Review
  • Visual speech decoding

Fingerprint

Dive into the research topics of 'A review of recent advances in visual speech decoding'. Together they form a unique fingerprint.

Cite this