Enhancing Emotional Expressiveness in Voice Conversion Using Seq2Seq and CycleGAN

Authors

  • Didi Faouzi University Yahia Fares of Medea image/svg+xml , Department of Common Core in Technology, Laboratory of Physics of Experimental Techniques and its Applications, University Yahia Fares of Medea, Medea, 26000, Algeria. Email: didifouzi19@gmail.com ; didi.faouzi@univ-medea.dz Author
  • Abdullayev Vugar Azerbaijan State Oil and Industry University image/svg+xml , Department of Computer Engineering, Azerbaijan State Oil and Industry University, AZ1010, Azerbaijan. Email: abdulvugar@mail.ru; vuqar.abdullayev@asoiu.edu.az Author
  • Hayal Mohammed R. Mansoura University image/svg+xml , Department of Electronics and Communications Engineering, Faculty of Engineering, Mansoura University, Mansoura 35516, Egypt. Email: mohammedraisan@gmail.com ; mohammedraisan@std.mans.edu.eg Author
  • Ebrahim E. Elsayed Mansoura University image/svg+xml , Department of Electronics and Communications Engineering, Faculty of Engineering, Mansoura University, Mansoura 35516, Egypt. Email: engebrahem16@gmail.com ; engebrahem16@std.mans.edu.eg Author

DOI:

https://doi.org/10.71426/jcdt.v1.i2.pp98-103

Keywords:

Emotional voice conversion, Seq2Seq modeling, CycleGAN, Non-parallel speech transformation, Mel-spectrograms, Speaker identity preservation.

Abstract

Emotional voice conversion (EVC) aims to transform the emotional characteristics of speech while preserving speaker identity and linguistic content, and plays a critical role in affective computing, expressive speech synthesis, and human--computer interaction. Despite recent progress, existing EVC approaches often struggle to jointly model long-term temporal dependencies and achieve perceptually realistic spectral transformations, particularly in non-parallel training scenarios. To address these challenges, this paper proposes a high-fidelity emotional voice conversion framework that integrates Sequence-to-Sequence (Seq2Seq) temporal modeling with Cycle-Consistent Generative Adversarial Networks (CycleGAN). The proposed architecture operates entirely in the Mel-spectrogram domain. A Seq2Seq encoder--decoder with attention is first employed to capture long-range temporal dependencies and generate a coarse emotion-aware spectral representation. Subsequently, a CycleGAN-based refinement module enhances spectral realism and emotional expressiveness through adversarial and cycle-consistent learning, without requiring parallel emotional speech data. A neural vocoder is finally used to reconstruct the time-domain waveform from the refined Mel-spectrogram. The proposed framework is evaluated on the Emotional Speech Dataset (ESD) using objective metrics including Mel-Cepstral Distortion (MCD), fundamental frequency (Fo) root-mean-square error (RMSE), and structural similarity index (SSIM), along with subjective listening evaluations. Experimental results demonstrate that the proposed Seq2Seq--CycleGAN model outperforms conventional Seq2Seq-only and CycleGAN-only baselines in terms of emotional expressiveness, speech naturalness, and speaker similarity, confirming the effectiveness of jointly leveraging temporal modeling and adversarial spectral refinement for high-quality emotional voice conversion.

References

[1] Stylianou Y, Cappé O, Moulines E. Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing. 2002 Aug 6;6(2):131–142. Available from: https://doi.org/10.1109/89.661472

[2] Toda T, Black AW, Tokuda K. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing. 2007 Oct 15;15(8):2222–2235. Available from: https://doi.org/10.1109/TASL.2007.907344

[3] Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing. 2004 Apr 30;13(4):600–612. Available from: https://doi.org/10.1109/TIP.2003.819861

[4] ITU-T. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs (Recommendation P.862). 2001. [Online Available]: https://www.itu.int/rec/T-REC-P.862

[5] Kameoka H, Kaneko T, Tanaka K, Hojo N. StarGAN-VC: Non-parallel many-to-many voice conversion using Star generative adversarial networks. In: Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT); 2018 Dec 18; Athens, Greece. pp. 266–273. Available from: https://doi.org/10.1109/SLT.2018.8639535

[6] Kaneko T, Kameoka H. Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint. arXiv:1711.11293. 2017 Nov 30. Available from: https://arxiv.org/abs/1711.11293v2

[7] Kaneko T, Kameoka H, Tanaka K, Hojo N. CycleGAN-VC2: Improved CycleGAN-based non-parallel voice conversion. In: Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2019 May 12. pp. 6820–6824. Available from: https://doi.org/10.1109/ICASSP.2019.8682897

[8] Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerry-Ryan R, Saurous RA. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2018 Apr 15. pp. 4779–4783. Available from: https://doi.org/10.1109/ICASSP.2018.8461368

[9] Kaneko T, Kameoka H, Tanaka K, Hojo N. CycleGAN-VC3: Examining and improving CycleGAN-VCs for mel-spectrogram conversion. arXiv preprint. arXiv:2010.11672. 2020 Oct 22. Available from: https://doi.org/10.48550/arXiv.2010.11672

[10] Kaneko T, Kameoka H, Tanaka K, Hojo N. MaskCycleGAN-VC: Learning non-parallel voice conversion with filling-in frames. In: Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2021 Jun 6. pp. 5919–5923. Available from: https://doi.org/10.1109/ICASSP39728.2021.9414851

[11] Zhou K, Sisman B, Li H. Transforming spectrum and prosody for emotional voice conversion with non-parallel training data. arXiv preprint. arXiv:2002.00198. 2020 Feb 1. Available from: https://doi.org/10.48550/arXiv.2002.00198

[12] Zhou K, Sisman B, Liu R, Li H. Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. In: Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2021 Jun 6. pp. 920–924. Available from: https://doi.org/10.1109/ICASSP39728.2021.9413391

[13] Zhou K, Sisman B, Liu R, Li H. Emotional voice conversion: Theory, databases and ESD. Speech Communication. 2022 Feb 1;137:1–8. Available from: https://doi.org/10.1016/j.specom.2021.11.006

[14] Fu C, Liu C, Ishi CT, Ishiguro H. CycleTransGAN-EVC: A CycleGAN-based emotional voice conversion model with transformer. arXiv preprint. arXiv:2111.15159. 2021 Nov 30. Available from: https://doi.org/10.48550/arXiv.2111.02820

[15] Yang Z, Jing X, Triantafyllopoulos A, Song M, Aslan I, Schuller BW. An overview and analysis of sequence-to-sequence emotional voice conversion. arXiv preprint. arXiv:2203.15873. 2022 Mar 29. Available from: https://doi.org/10.48550/arXiv.2203.15873

[16] Zhou K, Sisman B, Busso C, Li H. Mixed emotion modelling for emotional voice conversion. arXiv preprint. arXiv:2210.00319. 2022. Available from: https://doi.org/10.48550/arXiv.2210.00319

Zhou K, Sisman B, Busso C, Ma B, Li H. Mixed-EVC: Mixed emotion synthesis and control in voice conversion. arXiv preprint. arXiv:2210.13756. 2022 Oct 25. Available from: https://doi.org/10.48550/arXiv.2210.13756

[17] Qian K, Zhang Y, Chang S, Yang X, Hasegawa-Johnson M. AutoVC: Zero-shot voice style transfer with only autoencoder loss. In: Proceedings of the International Conference on Machine Learning (ICML); 2019 May 24. pp. 5210–5219. Available from: https://doi.org/10.48550/arXiv.1905.05879

[18] Chou JC, Yeh CC, Lee HY, Lee LS. Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations. arXiv preprint. arXiv:1804.02812. 2018 Apr 9. Available from: https://doi.org/10.48550/arXiv.1804.02812

[19] Denton EL. Unsupervised learning of disentangled representations from video. In: Advances in Neural Information Processing Systems. 2017;30. Available from: https://doi.org/10.48550/arXiv.1705.10915

[20] Sisman B, Yamagishi J, King S, Li H. An overview of voice conversion and its challenges: From statistical modeling to deep learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2020 Nov 17;29:132–157. Available from: https://doi.org/10.1109/TASLP.2020.3038524

[21] Li YA, Zare A, Mesgarani N. StarGANv2-VC: A diverse, unsupervised, non-parallel framework for natural-sounding voice conversion. arXiv preprint. arXiv:2107.10394. 2021 Jul 21. Available from: https://doi.org/10.48550/arXiv.2107.10394

[22] Emotional Speech Dataset (ESD). [Online Available]: https://www.kaggle.com/datasets/nguyenthanhlim/emotional-speech-dataset-esd

[23] Ren Y, Ruan Y, Tan X, Qin T, Zhao S, Zhao Z, Liu TY. FastSpeech: Fast, robust and controllable text-to-speech. In: Advances in Neural Information Processing Systems. 2019;32. Available from: https://doi.org/10.48550/arXiv.1905.09263

[24] Elsayed M, Hadhoud S, Elsetohy A, Osman M, Gomaa W. Non-parallel training approach for emotional voice conversion using CycleGAN. In: Proceedings of the International Conference on Informatics in Control, Automation and Robotics (ICINCO); 2023. pp. 17–24. Available from: https://www.scitepress.org/Papers/2023/121560/121560.pdf

[25] Qi T, Wang S, Lu C, Zhao Y, Zong Y, Zheng W. Towards realistic emotional voice conversion using controllable emotional intensity. arXiv preprint. arXiv:2407.14800. 2024 Jul 20. Available from: http://arxiv.org/abs/2407.14800

[26] Arik SO, Chen J, Peng K, Ping W, Zhou Y. Neural voice cloning with a few samples. arXiv preprint. arXiv:1802.06006. 2018. Available from: https://doi.org/10.48550/arXiv.1802.06006

Downloads

Published

10-02-2026

How to Cite

[1]
D. Faouzi, A. Vugar, H. Mohammed R., and E. E. Elsayed, “Enhancing Emotional Expressiveness in Voice Conversion Using Seq2Seq and CycleGAN”, Journal of Computing and Data Technology, vol. 1, no. 2, pp. 98–103, Feb. 2026, doi: 10.71426/jcdt.v1.i2.pp98-103.