Extending StarGAN-VC to Unseen Speakers Using RawNet3 Speaker Representation

Bogyung Park; Somin Park; Hyunki Hong

Extending StarGAN-VC to Unseen Speakers Using RawNet3 Speaker Representation

Bogyung Park

Somin Park

Hyunki Hong

KIPS Transactions on Software and Data Engineering, Vol. 12, No. 7, pp. 303-314, Jul. 2023

https://doi.org/10.3745/KTSDE.2023.12.7.303, PDF Download:

Keywords: Voice Conversion, Speaker Attribute, generalization, StarGAN-VC, RawNet3
Abstract

Voice conversion, a technology that allows an individual’s speech data to be regenerated with the acoustic properties(tone, cadence, gender) of another, has countless applications in education, communication, and entertainment. This paper proposes an approach based on the StarGAN-VC model that generates realistic-sounding speech without requiring parallel utterances. To overcome the constraints of the existing StarGAN-VC model that utilizes one-hot vectors of original and target speaker information, this paper extracts feature vectors of target speakers using a pre-trained version of Rawnet3. This results in a latent space where voice conversion can be performed without direct speaker-to-speaker mappings, enabling an any-to-any structure. In addition to the loss terms used in the original StarGAN-VC model, Wasserstein distance is used as a loss term to ensure that generated voice segments match the acoustic properties of the target voice. Two Time-Scale Update Rule (TTUR) is also used to facilitate stable training. Experimental results show that the proposed method outperforms previous methods, including the StarGAN-VC network on which it was based.

Statistics

Show / Hide Statistics

Statistics (Cumulative Counts from September 1st, 2017)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.

Cite this article

[IEEE Style]

B. Park, S. Park, H. Hong, "Extending StarGAN-VC to Unseen Speakers Using RawNet3 Speaker Representation," KIPS Transactions on Software and Data Engineering, vol. 12, no. 7, pp. 303-314, 2023. DOI: https://doi.org/10.3745/KTSDE.2023.12.7.303.

[ACM Style]

Bogyung Park, Somin Park, and Hyunki Hong. 2023. Extending StarGAN-VC to Unseen Speakers Using RawNet3 Speaker Representation. KIPS Transactions on Software and Data Engineering, 12, 7, (2023), 303-314. DOI: https://doi.org/10.3745/KTSDE.2023.12.7.303.