Skip to main navigation Skip to search Skip to main content

Streamable Speech Representation Disentanglement and Multi-Level Prosody Modeling for Live One-Shot Voice Conversion

  • Haoquan Yang*
  • , Liqun Deng
  • , Yu Ting Yeung
  • , Nianzu Zheng
  • , Yong Xu
  • *Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

Abstract

This paper takes efforts to tackle the challenge of “live” one-shot voice conversion (VC), which performs conversion across arbitrary speakers in a streaming way while retaining high intelligibility and naturalness. We propose a hybrid unsupervised and supervised learning based VC model with a two-stage model training strategy. Specially, we first employ an unsupervised disentanglement framework to separate speech representations of different granularities Experimental results demonstrate that our proposed method achieves comparable performance on speech naturalness, intelligibility and speaker similarity with offline VC solutions, with sufficient efficiency for practical real-time applications. Audio samples are available online for demonstration.

Original languageEnglish
Pages (from-to)2578-2582
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2022-September
DOIs
StatePublished - 2022
Externally publishedYes
Event23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of
Duration: 18 Sep 202222 Sep 2022

Keywords

  • mutual information
  • one-shot voice conversion
  • streaming inference
  • subword

Fingerprint

Dive into the research topics of 'Streamable Speech Representation Disentanglement and Multi-Level Prosody Modeling for Live One-Shot Voice Conversion'. Together they form a unique fingerprint.

Cite this