Comparison of Voice Cloning Algorithms in Zero-shot and Few-shot Scenarios
https://doi.org/10.15514/ISPRAS-2024-36(4)-1
Abstract
Voice cloning technology has made significant strides in recent years, with applications ranging from personalized virtual assistants to sophisticated entertainment systems. This study compares nine voice cloning models, focusing on both zero-shot and fine-tuned approaches. Zero-shot voice cloning models have gained attention for their ability to generate high-quality synthetic voices without requiring extensive training data for each new voice and for their capability to perform real-time inference online. In contrast, non-zero- shot models typically require additional data but can offer improved fidelity in voice reproduction. The study comprises two key experiments. The first experiment evaluates the performance of zero-shot voice cloning models, analyzing their ability to reproduce target voices without prior exposure accurately. The second experiment involves fine-tuning the models on target speakers to assess improvements in voice quality and adaptability. The models are evaluated based on key metrics assessing voice quality, speaker identity preservation, and subjective and objective performance measures. The findings indicate that while zero-shot models offer greater flexibility and ease of deployment, fine-tuned models can deliver superior performance.
About the Authors
Olga HOVHANNISYANArmenia
Researcher at the Center of Advanced Software Technologies (CAST) and a Ph.D. student at Russian-Armenian University, specializing in mathematical and software support for computing systems. She holds a B.Sc. in Informatics and Applied Mathematics from the Russian- Armenian University of Armenia (2021) and an M.Sc. in intellectual systems and robotics from Russian-Armenian University (2023). Her research centers on speech processing and voice synthesis, including voice cloning, as well as advancements in machine learning.
David SARGSYAN
Armenia
Bachelor’s student in Applied Mathematics and Informatics at the Russian- Armenian University. He is also a researcher at the Center of Advanced Software Technologies (CAST). His research interests include speech recognition, text-to-speech technologies, and large language models (LLMs).
Artur MALAJYAN
Armenia
Received his Bachelor’s degree in Informatics and Applied Mathematics from the Russian-Armenian University in 2020. In 2022, he earned a Master’s degree in Machine Learning from Russian-Armenian University, Armenia. He is currently a researcher at the Center of Advanced Software Technologies (CAST). His research interests include natural language processing (NLP) and voice technologies.
References
1. Edresson Casanova, Kelly Davis, Eren Gölge, Görkem Göknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, Julian Weber. XTTS: a Massively Multilingual Zero- Shot Text-to-Speech Model. arXiv:2406.04904, 2024.
2. Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani. StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. arXiv preprint arXiv:2306.07691, 2023
3. Edresson Casanova, Julian Weber, Christopher Shulby, Arnaldo Candido Junior, Eren Gölge, Moacir Antonelli Ponti. YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone. arXiv preprint arXiv:2112.02418, 2023.
4. Zengyi Qin, Wenliang Zhao, Xumin Yu, Xin Sun. OpenVoice: Versatile Instant Voice Cloning. arXiv preprint arXiv:2312.01479v5, 2024.
5. Puyuan Peng, Po-Yao Huang, Daniel Li, Abdelrahman Mohamed, David Harwath. VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild. arXiv preprint arXiv:2403.16973v1, 2024.
6. Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei. Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling. arXiv preprint arXiv:2303.03926v1, 2023.
7. Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao. NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models. arXiv:2403.03100, 2024
8. Jaehyeon Kim, Jungil Kong, Juhee Son. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. arXiv:2106.06103
9. Retrieval based Voice Cloning. Available at the link: https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI
10. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu. WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499, 2016.
11. Sercan O. Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, Shubho Sengupta, Mohammad Shoeybi. Deep Voice: Real-time Neural Text-to-Speech. arXiv:1702.07825, 2017.
12. Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. arXiv:1806.04558, 2018.
13. James Betker. Better speech synthesis through scaling. arXiv:2305.07243, 2023.
14. Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous. Tacotron: Towards End-to-End Speech Synthesis. arXiv:1703.10135, 2017.
15. Jaehyeon Kim, Sungwon Kim, Jungil Kong, Sungroh Yoon. Glow-TTS: A Generative Flow for Text-to- Speech via Monotonic Alignment Search. arXiv:2005.11129, 2020.
16. L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 487–488.
17. Robert F. Kubichek. Mel-cepstral distance measure for objective speech quality assessment. Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, 1:125–128 vol.1, 1993.
18. J. Benesty, J. Chen, Y. Huang, and I. Cohen, “Pearson correlation coefficient,” in Noise reduction in speech processing. Springer, pp. 1–4, 2009.
19. Daniel Griffin and Jae Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984.
20. Veaux, Christophe; Yamagishi, Junichi; MacDonald, Kirsten. CSTR VCTK Corpus: English Multi- speaker Corpus for CSTR Voice Cloning Toolkit, [sound]. University of Edinburgh. The Centre for Speech Technology Research (CSTR). https://doi.org/10.7488/ds/1994, 2017.
Review
For citations:
HOVHANNISYAN O., SARGSYAN D., MALAJYAN A. Comparison of Voice Cloning Algorithms in Zero-shot and Few-shot Scenarios. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2024;36(4):7-16. https://doi.org/10.15514/ISPRAS-2024-36(4)-1