Skip to main navigation Skip to search Skip to main content

Frame selection in Si-DNN phonetic space with Wavenet vocoder for voice conversion without parallel training data

  • Feng Long Xie*
  • , Frank K. Soong
  • , Xi Wang
  • , Lei He
  • , Haifeng Li
  • *Corresponding author for this work
  • Harbin Institute of Technology
  • Microsoft USA
  • Microsoft Cloud and AI

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In this paper, we propose a frame selection approach to voice conversion with speaker independent deep neural network (SI-DNN) and Kullback-Leibler divergence (KLD). The acoustic difference between source and target speaker is equalized with SI-DNN in the ASR senone phonetic space. KLD is used as an ideal distortion measure to select the corresponding target frame given the source frame. Acoustic trajectory of the selected frames is rendered with maximum probability trajectory generation algorithm. WaveNet based vocoder is applied on the converted acoustic trajectory to get the final speech waveform. From the subjective results we find that 1) the proposed method can achieve better performance than the phonetic cluster based selection method [16]; 2) by applying WaveNet vocoder the naturalness and speaker similarity can be significantly improved compared with linear predictive coding (LPC) based vocoder; 3) WaveNet vocoder trained only with spectral features i.e., line spectrum pairs (LSP) can better maintain the pitch pattern towards target speaker than WaveNet vocoder trained with both spectral features i.e., LSP and prosodic features (F0 and Unvoiced/Voiced flag).

Original languageEnglish
Title of host publication2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages56-60
Number of pages5
ISBN (Electronic)9781538656273
DOIs
StatePublished - 2 Jul 2018
Event11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Taipei, Taiwan, Province of China
Duration: 26 Nov 201829 Nov 2018

Publication series

Name2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings

Conference

Conference11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018
Country/TerritoryTaiwan, Province of China
CityTaipei
Period26/11/1829/11/18

Keywords

  • Deep neural network
  • Kullback-Leibler divergence
  • Voice conversion
  • WaveNet vocoder

Fingerprint

Dive into the research topics of 'Frame selection in Si-DNN phonetic space with Wavenet vocoder for voice conversion without parallel training data'. Together they form a unique fingerprint.

Cite this