Method

Text Processing

The input text characters were preprocessed and normalized before feeding to the TTS. We classified text preprocessing into two main parts. The written text to spoken text conversion module and text preprocessing module. The former refers to the conversion between written text into the spoken text, e.g., the date written as 1 Mar 2022 into “first of march two thousand twenty two” or $100 into “one hundred dollars”. This module is different for each application where the user may prefer a different way of pronunciation. More detail of our current text normalization can be found here. The latter refers to the process of adding more information to text. After converting to spoken text, Thai and English characters were preprocessing. The Thai preprocessing are as follows:

  1. Normalize Thai numbers to Arabic numbers e.g., ๑ to 1, and ๒ to 2.

  2. ไม้ยมก, e.g., ดีๆ, มากๆ is preprocessed to ดีๆดี and มากๆมาก. We intentionally left ๆ mark as an indicator that the first syllable should be read shorter than the second syllable.

  3. คำควบกล้ำ ทร, ทรัพย์, is replaced with “ซัพย์”. Since the set of these words are closed, we can define a lookup dictionary and perform pattern matching.

For English characters:

The common preprocessing are:

  1. Normalize multiple whitespace into one.

  2. Adding starting <sos> and ending <eos> tokens in front of the sentence.

  3. The <sep> token is used as the pause mark between the sentences. For the English sentence, the <sep> is placed after “.”.

  4. Encoded text into one-hot representation. The text symbols included IPA phonetic representations, Thai characters, punctuations, and numbers.

The Thai word boundary and other Thai pronunciation rules were learned by the model.

Speaker Encoder

The speaker encoder model encodes the speech from each speaker into a latent representation. The key component of the encoding model is a loss function where the model must maximize the distance of a speech pair between a different speaker and minimize the distance of the speech pair of the same speaker. The two main types of a loss function which correspond to the training objective are: 1) classification objective (softmax variant), 2) metric learning objective. The current trend of the speaker encoding trained on both objectives.

where

The speaker encoding is composed of convolutional block, transformer encoder, and the attention. The speaker encoding and the convolutional block designs is showed in the following figure:

Generator

The generator model converts the one hot encoding text symbols into a log-mel spectrogram. The architecture design of the generator model is as follows:

\text{total_loss}=\text{prior_loss}+\text{duration_loss}+\text{mel_loss}+\text{auxiliary_mel_loss}

The total training loss is computed as follow:

\text{generative_loss} = \text{adversarial_loss}(G; D) + \text{feature_matching_loss}(G; D) + \text{Mel_loss}(G)
\text{discriminative_loss} = \text{adversarial_loss}(D; G)

Adversarial_loss is computed as follow:

\text{Mel_loss} is a reconstruction loss where the mel-spectrogram was recomputed from the synthetic speech by the generative model. The computation is as follow:

The feature matching loss is a learned similarity metric measured by the difference in features of the discriminator between a ground truth sample and a generated sample. Every intermediate feature of the discriminator is extracted, and the L1 distance between a ground truth sample and a conditionally generated sample in each feature space is calculated. The computation is as follow:

Dataset

Experiment Setup

Model Hyperparameters

References

Last updated