Introduction
Neural network text-to-speech (TTS) is out-performed other speech generation model, e.g., unit selection, concatenation, hidden-markov model
1^{1}
, and has been rapidly progress over the year since the advent of Wavenet
2^{2}
, a generative model which generates a speech waveform from the text input representation. However, the representation of the Wavenet input required engineering from an expert. Tacotron
3^{3}
had been developed to overcome this limitation by integrating the attention model and then training the model to learn the hidden representation of the text input.
The largely different nature of text and waveform sequence, which is its length and the phase of the signal, causes the difficulty to train the end-to-end TTS. Tacotron proposed the two step speech generation where the generator model generated the log-mel spectrogram as an intermediate representation of the speech from the text representation, and the vocoder generated the speech waveform from the intermediate representation. This is because Tacotron2
4^{4}
, and Transformer TTS
5^{5}
incorporate the neural vocoder, i.e., Wavenet, resulting in a comparable mean opinion score between actual speech and the generated speech.
Despite its highly intelligible and naturally generated speech, the limitation exists from its autoregressive nature where the error accumulates over the generative process. Moreover, the computational time during both training and inference is large which is not suitable for a real world application consisting of large user’s requests. Fastspeech and Fastspeech2 is a non-autoregressive text-to-speech which achieves fast computational time and is able to generate high natural speech. FastSpeech introduced the variance adapter module predicting the target duration of each input representation and the fundamental frequency (pitch) of the target log-mel representation. Glow-TTS
6^{6}
proposed the stochastic monotonic aligner and the normalizing flow speech feature decoder. The use of stochastic monotonic aligner solved the limitation of the FastSpeech, where the external aligner model is required, by learning the alignment between the hidden input representation and the groundtruth log-mel spectrogram during the training time using the log-likelihood estimation. Grad-TTS
7^{7}
took on the different approach by incorporating the diffusion model which is currently achieved state-of-the-art in the image generation. Not only the generative model, the neural vocoder also has been rapidly progressing, aiming to achieve higher naturalness and faster computational time. HiFiGAN
8^{8}
achieved real-time factor on the natural speech synthesis on the CPU.
For the multi-speakers TTS, the speaker information is incorporated into the TTS model by conditioning the speech generation using the encoded speaker information in the form of embedding vector
9^{9}
. Multiple methods to encode the speaker information have been developed. The learnable table lookup (embedding layer in Pytorch
10^{10}
and Tensorflow
11^{11}
) encoded speaker information as the hard boundary, where the input is represented as an one-hot encoding. The other approaches trained the speaker encoding model using the speaker verification objective where the model predicted the target speaker from the given log-mel spectrogram. With this approach, the speaker encoding learned the speaker embedding space, enabling speaker interpolation from one speaker to others. The current progression of the speaker encoding incorporated the attention mechanism, e.g., learnable dictionary encoding
12^{12}
, recurrent neural network
13^{13}
, and LSTM. While the former approach is easier to develop, the later enables zero-shot speaker learning in TTS.
The backbone of the TTS remains the same across multiple languages, consisted of a feature encoder, aligner, decoder module, and speaker encoder for the multi-speaker TTS. The differences are the text normalization and the input embedding model. For instance, English language is an intonation language and also consists of homograph words, thus it requires a grapheme-to-phoneme model and the punctuation mark to improve pronunciation. Chinese and Thai is a tonal language, thus it required a tonal mark. In Thai language, the tonal mark is clearly written in text, with an exception of “คำทับศัพท์” which is an Thai word that originate from other languages. Thai language is not a homograph language, thus, although it is highly complex, the Thai syllable can be one-to-one matching to the pronunciation with some rare exceptions. The “ไม้ยมก” also affects the pronunciation where the first utterance is spoke shorter and than the second utterance. There are a closed set of “คำควบกล้ำ ทร” that pronounce as “ซ” which can be preprocessed using pattern matching. “คำควบไม่แท้” is the combination of the character that discarded the pronunciation of the following character. Thai language does not have a clear word boundary when written in a sentence, causing a difficulty in determining the word boundary.
Not only text processing for the specific language, many parts of the neural text-to-speech model remain highly challenging problems, e.g., zero-shot multi-speakers TTS, multi-language TTS, the text and speech sequence aligner model, the generative model, neural vocoder, controllable speaker and speech prosody, low resource TTS model, a TTS model from noisy data, etc.
As of now, the TTS model has been widely adopted in many applications, from the screen reader for voice interface, web content reader, chat reader for the streamer, audiobook, and a video for entertaining and marketing. TTS can play a role in the Metaverse, providing customizable human speech. Therefore, with the careful user journey design, the highly natural and intelligible TTS can provide a seamless experience to the user, increasing the value of a product and a service.
In this paper, we propose a Thai multi-speaker bilingual text-to-speech model, namely VulcanTTS. The contributions are 1) Thai text normalization, 2) multi speaker encoder, 3) The generator model in the TTS that supported both multi-speaker and bilingual (Thai and English language). We use the pretrained HiFiGAN
14^{14}
as a vocoder model.

References

1^{1}
Yamagishi, Junichi. "An introduction to hmm-based speech synthesis." Technical Report (2006).
2^{2}
Oord, Aaron van den, et al. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).
3^{3}
Wang, Yuxuan, et al. "Tacotron: Towards end-to-end speech synthesis." arXiv preprint arXiv:1703.10135 (2017).
4^{4}
Shen, Jonathan, et al. "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions." 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018.
5^{5}
Li, Naihan, et al. "Neural speech synthesis with transformer network." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. No. 01. 2019.
6^{6}
He, Mutian, Yan Deng, and Lei He. "Robust sequence-to-sequence acoustic modeling with stepwise monotonic attention for neural TTS." arXiv preprint arXiv:1906.00672 (2019).
7^{7}
Popov, Vadim, et al. "Grad-tts: A diffusion probabilistic model for text-to-speech." International Conference on Machine Learning. PMLR, 2021.
8^{8}
Kong, Jungil, Jaehyeon Kim, and Jaekyoung Bae. "Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis." Advances in Neural Information Processing Systems 33 (2020): 17022-17033.
9^{9}
Jia, Ye, et al. "Transfer learning from speaker verification to multispeaker text-to-speech synthesis." Advances in neural information processing systems 31 (2018).
10^{10}
https://pytorch.org/
11^{11}
https://www.tensorflow.org/
12^{12}
Cai, Weicheng, et al. "A novel learnable dictionary encoding layer for end-to-end language identification." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
13^{13}
Gibiansky, Andrew, et al. "Deep voice 2: Multi-speaker neural text-to-speech." Advances in neural information processing systems 30 (2017).
Copy link