Skip to main navigation Skip to search Skip to main content

Audio-Semantic Enhanced Pose-Driven Talking Head Generation

  • Meng Liu
  • , Da Li
  • , Yongqiang Li
  • , Xuemeng Song
  • , Liqiang Nie*
  • *Corresponding author for this work
  • Shandong Jianzhu University
  • Shandong University
  • Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Talking head generation, aiming to create photo-realistic videos from a single reference image and audio input, has emerged as a vibrant area of interest within the computer vision community. Despite notable advancements, several challenges remain unaddressed. For instance, many existing approaches overlook the nuanced relationship between audio semantics and head movement, such as nodding in agreement during affirmative expressions. Additionally, the visual quality of generated content, particularly in depicting teeth, often falls short of achieving authentic realism. To address these limitations, we introduce a groundbreaking audio-semantic enhanced pose-driven talking head generation method. Our approach encompasses a multimodal 3DMM parameter prediction network alongside a high-fidelity video synthesis network, meticulously designed to produce authentic and high-quality talking head videos. The multimodal 3DMM parameter prediction network harnesses both acoustic and audio-deduced semantic information, facilitating accurate head pose predictions that resonate with the semantics of spoken words. Furthermore, to significantly improve the depiction of the mouth area, especially the teeth, our video synthesis stage incorporates a mouth-enhanced network augmented by both local and global discriminators. Comprehensive evaluations across diverse metrics affirm the superiority of our method.

Original languageEnglish
Pages (from-to)11056-11069
Number of pages14
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume34
Issue number11
DOIs
StatePublished - 2024
Externally publishedYes

Keywords

  • One-shot
  • head pose
  • talking head generation
  • word semantics

Fingerprint

Dive into the research topics of 'Audio-Semantic Enhanced Pose-Driven Talking Head Generation'. Together they form a unique fingerprint.

Cite this