Introduction

The backbone of the TTS remains the same across multiple languages, consisted of a feature encoder, aligner, decoder module, and speaker encoder for the multi-speaker TTS. The differences are the text normalization and the input embedding model. For instance, English language is an intonation language and also consists of homograph words, thus it requires a grapheme-to-phoneme model and the punctuation mark to improve pronunciation. Chinese and Thai is a tonal language, thus it required a tonal mark. In Thai language, the tonal mark is clearly written in text, with an exception of “คำทับศัพท์” which is an Thai word that originate from other languages. Thai language is not a homograph language, thus, although it is highly complex, the Thai syllable can be one-to-one matching to the pronunciation with some rare exceptions. The “ไม้ยมก” also affects the pronunciation where the first utterance is spoke shorter and than the second utterance. There are a closed set of “คำควบกล้ำ ทร” that pronounce as “ซ” which can be preprocessed using pattern matching. “คำควบไม่แท้” is the combination of the character that discarded the pronunciation of the following character. Thai language does not have a clear word boundary when written in a sentence, causing a difficulty in determining the word boundary.

Not only text processing for the specific language, many parts of the neural text-to-speech model remain highly challenging problems, e.g., zero-shot multi-speakers TTS, multi-language TTS, the text and speech sequence aligner model, the generative model, neural vocoder, controllable speaker and speech prosody, low resource TTS model, a TTS model from noisy data, etc.

As of now, the TTS model has been widely adopted in many applications, from the screen reader for voice interface, web content reader, chat reader for the streamer, audiobook, and a video for entertaining and marketing. TTS can play a role in the Metaverse, providing customizable human speech. Therefore, with the careful user journey design, the highly natural and intelligible TTS can provide a seamless experience to the user, increasing the value of a product and a service.

References

Last updated