Enhancing Emotional Expressiveness in Voice Conversion Using Seq2Seq and CycleGAN
DOI:
https://doi.org/10.71426/jcdt.v1.i2.pp98-103Keywords:
Emotional voice conversion, Seq2Seq modeling, CycleGAN, Non-parallel speech transformation, Mel-spectrograms, Speaker identity preservation.Abstract
Emotional voice conversion (EVC) aims to transform the emotional characteristics of speech while preserving speaker identity and linguistic content, and plays a critical role in affective computing, expressive speech synthesis, and human--computer interaction. Despite recent progress, existing EVC approaches often struggle to jointly model long-term temporal dependencies and achieve perceptually realistic spectral transformations, particularly in non-parallel training scenarios. To address these challenges, this paper proposes a high-fidelity emotional voice conversion framework that integrates Sequence-to-Sequence (Seq2Seq) temporal modeling with Cycle-Consistent Generative Adversarial Networks (CycleGAN). The proposed architecture operates entirely in the Mel-spectrogram domain. A Seq2Seq encoder--decoder with attention is first employed to capture long-range temporal dependencies and generate a coarse emotion-aware spectral representation. Subsequently, a CycleGAN-based refinement module enhances spectral realism and emotional expressiveness through adversarial and cycle-consistent learning, without requiring parallel emotional speech data. A neural vocoder is finally used to reconstruct the time-domain waveform from the refined Mel-spectrogram. The proposed framework is evaluated on the Emotional Speech Dataset (ESD) using objective metrics including Mel-Cepstral Distortion (MCD), fundamental frequency (Fo) root-mean-square error (RMSE), and structural similarity index (SSIM), along with subjective listening evaluations. Experimental results demonstrate that the proposed Seq2Seq--CycleGAN model outperforms conventional Seq2Seq-only and CycleGAN-only baselines in terms of emotional expressiveness, speech naturalness, and speaker similarity, confirming the effectiveness of jointly leveraging temporal modeling and adversarial spectral refinement for high-quality emotional voice conversion.
References
[1] Stylianou Y, Cappé O, Moulines E. Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing. 2002 Aug 6;6(2):131–142. Available from: https://doi.org/10.1109/89.661472
[2] Toda T, Black AW, Tokuda K. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing. 2007 Oct 15;15(8):2222–2235. Available from: https://doi.org/10.1109/TASL.2007.907344
[3] Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing. 2004 Apr 30;13(4):600–612. Available from: https://doi.org/10.1109/TIP.2003.819861
[4] ITU-T. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs (Recommendation P.862). 2001. [Online Available]: https://www.itu.int/rec/T-REC-P.862
[5] Kameoka H, Kaneko T, Tanaka K, Hojo N. StarGAN-VC: Non-parallel many-to-many voice conversion using Star generative adversarial networks. In: Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT); 2018 Dec 18; Athens, Greece. pp. 266–273. Available from: https://doi.org/10.1109/SLT.2018.8639535
[6] Kaneko T, Kameoka H. Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint. arXiv:1711.11293. 2017 Nov 30. Available from: https://arxiv.org/abs/1711.11293v2
[7] Kaneko T, Kameoka H, Tanaka K, Hojo N. CycleGAN-VC2: Improved CycleGAN-based non-parallel voice conversion. In: Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2019 May 12. pp. 6820–6824. Available from: https://doi.org/10.1109/ICASSP.2019.8682897
[8] Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerry-Ryan R, Saurous RA. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2018 Apr 15. pp. 4779–4783. Available from: https://doi.org/10.1109/ICASSP.2018.8461368
[9] Kaneko T, Kameoka H, Tanaka K, Hojo N. CycleGAN-VC3: Examining and improving CycleGAN-VCs for mel-spectrogram conversion. arXiv preprint. arXiv:2010.11672. 2020 Oct 22. Available from: https://doi.org/10.48550/arXiv.2010.11672
[10] Kaneko T, Kameoka H, Tanaka K, Hojo N. MaskCycleGAN-VC: Learning non-parallel voice conversion with filling-in frames. In: Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2021 Jun 6. pp. 5919–5923. Available from: https://doi.org/10.1109/ICASSP39728.2021.9414851
[11] Zhou K, Sisman B, Li H. Transforming spectrum and prosody for emotional voice conversion with non-parallel training data. arXiv preprint. arXiv:2002.00198. 2020 Feb 1. Available from: https://doi.org/10.48550/arXiv.2002.00198
[12] Zhou K, Sisman B, Liu R, Li H. Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. In: Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2021 Jun 6. pp. 920–924. Available from: https://doi.org/10.1109/ICASSP39728.2021.9413391
[13] Zhou K, Sisman B, Liu R, Li H. Emotional voice conversion: Theory, databases and ESD. Speech Communication. 2022 Feb 1;137:1–8. Available from: https://doi.org/10.1016/j.specom.2021.11.006
[14] Fu C, Liu C, Ishi CT, Ishiguro H. CycleTransGAN-EVC: A CycleGAN-based emotional voice conversion model with transformer. arXiv preprint. arXiv:2111.15159. 2021 Nov 30. Available from: https://doi.org/10.48550/arXiv.2111.02820
[15] Yang Z, Jing X, Triantafyllopoulos A, Song M, Aslan I, Schuller BW. An overview and analysis of sequence-to-sequence emotional voice conversion. arXiv preprint. arXiv:2203.15873. 2022 Mar 29. Available from: https://doi.org/10.48550/arXiv.2203.15873
[16] Zhou K, Sisman B, Busso C, Li H. Mixed emotion modelling for emotional voice conversion. arXiv preprint. arXiv:2210.00319. 2022. Available from: https://doi.org/10.48550/arXiv.2210.00319
Zhou K, Sisman B, Busso C, Ma B, Li H. Mixed-EVC: Mixed emotion synthesis and control in voice conversion. arXiv preprint. arXiv:2210.13756. 2022 Oct 25. Available from: https://doi.org/10.48550/arXiv.2210.13756
[17] Qian K, Zhang Y, Chang S, Yang X, Hasegawa-Johnson M. AutoVC: Zero-shot voice style transfer with only autoencoder loss. In: Proceedings of the International Conference on Machine Learning (ICML); 2019 May 24. pp. 5210–5219. Available from: https://doi.org/10.48550/arXiv.1905.05879
[18] Chou JC, Yeh CC, Lee HY, Lee LS. Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations. arXiv preprint. arXiv:1804.02812. 2018 Apr 9. Available from: https://doi.org/10.48550/arXiv.1804.02812
[19] Denton EL. Unsupervised learning of disentangled representations from video. In: Advances in Neural Information Processing Systems. 2017;30. Available from: https://doi.org/10.48550/arXiv.1705.10915
[20] Sisman B, Yamagishi J, King S, Li H. An overview of voice conversion and its challenges: From statistical modeling to deep learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2020 Nov 17;29:132–157. Available from: https://doi.org/10.1109/TASLP.2020.3038524
[21] Li YA, Zare A, Mesgarani N. StarGANv2-VC: A diverse, unsupervised, non-parallel framework for natural-sounding voice conversion. arXiv preprint. arXiv:2107.10394. 2021 Jul 21. Available from: https://doi.org/10.48550/arXiv.2107.10394
[22] Emotional Speech Dataset (ESD). [Online Available]: https://www.kaggle.com/datasets/nguyenthanhlim/emotional-speech-dataset-esd
[23] Ren Y, Ruan Y, Tan X, Qin T, Zhao S, Zhao Z, Liu TY. FastSpeech: Fast, robust and controllable text-to-speech. In: Advances in Neural Information Processing Systems. 2019;32. Available from: https://doi.org/10.48550/arXiv.1905.09263
[24] Elsayed M, Hadhoud S, Elsetohy A, Osman M, Gomaa W. Non-parallel training approach for emotional voice conversion using CycleGAN. In: Proceedings of the International Conference on Informatics in Control, Automation and Robotics (ICINCO); 2023. pp. 17–24. Available from: https://www.scitepress.org/Papers/2023/121560/121560.pdf
[25] Qi T, Wang S, Lu C, Zhao Y, Zong Y, Zheng W. Towards realistic emotional voice conversion using controllable emotional intensity. arXiv preprint. arXiv:2407.14800. 2024 Jul 20. Available from: http://arxiv.org/abs/2407.14800
[26] Arik SO, Chen J, Peng K, Ping W, Zhou Y. Neural voice cloning with a few samples. arXiv preprint. arXiv:1802.06006. 2018. Available from: https://doi.org/10.48550/arXiv.1802.06006
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Ebrahim E. Elsayed, Abdullayev Vugar, Hayal Mohammed R., Didi Faouzi (Author)

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.