Skip to main navigation Skip to search Skip to main content

MS-VBRVQ: Multi-scale variable bitrate speech residual vector quantization

  • Yukun Qian
  • , Shiyun Xu
  • , Xuyi Zhuang
  • , Zehua Zhang
  • , Mingjiang Wang*
  • *Corresponding author for this work
  • Harbin Institute of Technology Shenzhen

Research output: Contribution to journalArticlepeer-review

Abstract

Recent speech quantization compression models have adopted residual vector quantization (RVQ) methods. However, these models typically use fixed bitrates, allocating the same number of time frames at a constant scale across all speech segments. This approach may lead to bitrate inefficiency, particularly when the audio contains simpler segments. To address this limitation, we introduce a multi-scale variable bitrate approach by incorporating a relative importance map, adaptive threshold masks, and a gradient estimation function into the RVQ-GAN model. This method allows the allocation of time frames at varying time scales, depending on the complexity of the audio. For more complex audio, a greater number of time frames are allocated, while fewer time frames are assigned to simpler segments. Additionally, we propose both symmetric and asymmetric decoding methods. Asymmetric decoding is easier to implement and integrates seamlessly into the system, while symmetric decoding delivers superior audio quality at lower bitrates. Subjective and objective experiments demonstrate that, compared to EnCodec, both of our decoding methods deliver excellent audio quality at lower bitrates across various speech and singing datasets, with only a slight increase in computational cost. In comparison to the VRVQ method, we achieve comparable audio quality at even lower bitrates, while requiring less computational cost.

Original languageEnglish
Article number103346
JournalSpeech Communication
Volume177
DOIs
StatePublished - Feb 2026
Externally publishedYes

Keywords

  • Importance map
  • Multi-scale
  • RVQ
  • Variable bitrate

Fingerprint

Dive into the research topics of 'MS-VBRVQ: Multi-scale variable bitrate speech residual vector quantization'. Together they form a unique fingerprint.

Cite this