Neural network text-to-speech (TTS) is out-performed other speech generation model, e.g., unit selection, concatenation, hidden-markov model1, and has been rapidly progress over the year since the advent of Wavenet2, a generative model which generates a speech waveform from the text input representation. However, the representation of the Wavenet input required engineering from an expert. Tacotron3 had been developed to overcome this limitation by integrating the attention model and then training the model to learn the hidden representation of the text input.
The largely different nature of text and waveform sequence, which is its length and the phase of the signal, causes the difficulty to train the end-to-end TTS. Tacotron proposed the two step speech generation where the generator model generated the log-mel spectrogram as an intermediate representation of the speech from the text representation, and the vocoder generated the speech waveform from the intermediate representation. This is because Tacotron24, and Transformer TTS5 incorporate the neural vocoder, i.e., Wavenet, resulting in a comparable mean opinion score between actual speech and the generated speech.
Despite its highly intelligible and naturally generated speech, the limitation exists from its autoregressive nature where the error accumulates over the generative process. Moreover, the computational time during both training and inference is large which is not suitable for a real world application consisting of large user’s requests. Fastspeech and Fastspeech2 is a non-autoregressive text-to-speech which achieves fast computational time and is able to generate high natural speech. FastSpeech introduced the variance adapter module predicting the target duration of each input representation and the fundamental frequency (pitch) of the target log-mel representation. Glow-TTS6 proposed the stochastic monotonic aligner and the normalizing flow speech feature decoder. The use of stochastic monotonic aligner solved the limitation of the FastSpeech, where the external aligner model is required, by learning the alignment between the hidden input representation and the groundtruth log-mel spectrogram during the training time using the log-likelihood estimation. Grad-TTS7 took on the different approach by incorporating the diffusion model which is currently achieved state-of-the-art in the image generation. Not only the generative model, the neural vocoder also has been rapidly progressing, aiming to achieve higher naturalness and faster computational time. HiFiGAN8 achieved real-time factor on the natural speech synthesis on the CPU.
For the multi-speakers TTS, the speaker information is incorporated into the TTS model by conditioning the speech generation using the encoded speaker information in the form of embedding vector9. Multiple methods to encode the speaker information have been developed. The learnable table lookup (embedding layer in Pytorch10 and Tensorflow11) encoded speaker information as the hard boundary, where the input is represented as an one-hot encoding. The other approaches trained the speaker encoding model using the speaker verification objective where the model predicted the target speaker from the given log-mel spectrogram. With this approach, the speaker encoding learned the speaker embedding space, enabling speaker interpolation from one speaker to others. The current progression of the speaker encoding incorporated the attention mechanism, e.g., learnable dictionary encoding12, recurrent neural network13, and LSTM. While the former approach is easier to develop, the later enables zero-shot speaker learning in TTS.
In this paper, we propose a Thai multi-speaker bilingual text-to-speech model, namely VulcanTTS. The contributions are 1) Thai text normalization, 2) multi speaker encoder, 3) The generator model in the TTS that supported both multi-speaker and bilingual (Thai and English language). We use the pretrained HiFiGAN14 as a vocoder model.
1Yamagishi, Junichi. "An introduction to hmm-based speech synthesis." Technical Report (2006).
2Oord, Aaron van den, et al. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).
3Wang, Yuxuan, et al. "Tacotron: Towards end-to-end speech synthesis." arXiv preprint arXiv:1703.10135 (2017).
4Shen, Jonathan, et al. "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions." 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018.
5Li, Naihan, et al. "Neural speech synthesis with transformer network." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. No. 01. 2019.
6He, Mutian, Yan Deng, and Lei He. "Robust sequence-to-sequence acoustic modeling with stepwise monotonic attention for neural TTS." arXiv preprint arXiv:1906.00672 (2019).
7Popov, Vadim, et al. "Grad-tts: A diffusion probabilistic model for text-to-speech." International Conference on Machine Learning. PMLR, 2021.
8Kong, Jungil, Jaehyeon Kim, and Jaekyoung Bae. "Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis." Advances in Neural Information Processing Systems 33 (2020): 17022-17033.
9Jia, Ye, et al. "Transfer learning from speaker verification to multispeaker text-to-speech synthesis." Advances in neural information processing systems 31 (2018).
12Cai, Weicheng, et al. "A novel learnable dictionary encoding layer for end-to-end language identification." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
13Gibiansky, Andrew, et al. "Deep voice 2: Multi-speaker neural text-to-speech." Advances in neural information processing systems 30 (2017).