Skip to main navigation Skip to search Skip to main content

ProsodyTalker: 3D Visual Speech Animation via Prosody Decomposition

  • Harbin Institute of Technology

Research output: Contribution to journalConference articlepeer-review

Abstract

Most existing 3D visual speech animation methods synthesize lip movements synchronized with speech, which however neglect head poses and therefore degrade the animation realism. The animation of head poses presents two primary challenges: (1) the intricate mapping between speech and head poses remains poorly understood and (2) the absence of 4D face datasets featuring realistic head poses. Inspired by prosody decomposition in speech processing, we discern that head movements correlate with the fundamental frequency (F0) of speech prosody, while lip movements align with the language content. These observations motivate us to propose a novel framework, dubbed ProsodyTalker, that concurrently synthesizes lip and head movements, grounded in the principles of prosody decomposition. The core idea is first to adopt information perturbation to explicitly decompose the speech prosody into pose-related F0 and lip-related language content. Then, an autoregressive content-oriented fusion decoder is employed to enhance lip synchronization in the synthesized facial sequences. To synthesize head poses, we design a transformer-based variational autoencoder to learn a latent distribution of facial sequences and propose an F0-conditioned latent diffusion model to establish a probabilistic mapping from F0 to pose-related latent codes. Furthermore, we contribute a large-scale 4D face dataset containing bunches of variations in identities, head poses, and facial motions. Extensive experiments show that our method achieves more realistic animation than state-of-the-art methods.

Original languageEnglish
Pages (from-to)5110-5118
Number of pages9
JournalProceedings of the AAAI Conference on Artificial Intelligence
Volume39
Issue number5
DOIs
StatePublished - 11 Apr 2025
Event39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 - Philadelphia, United States
Duration: 25 Feb 20254 Mar 2025

Fingerprint

Dive into the research topics of 'ProsodyTalker: 3D Visual Speech Animation via Prosody Decomposition'. Together they form a unique fingerprint.

Cite this