Why Disabilities?
Etna Lab: Research & Development
Vulcan's AI Products
Method
The main component of the TTS model is a Transformer
$^{15}$
. We trained the speaker encoder model to model the speaker embedding space. Then used those embedding vectors to condition the generator. Multi-speaker design results in better generalization and produce better pronunciation. To handle the multiple languages, we include the English dataset together with Thai dataset. Our model can generalize and produce speech including both Thai and English language from the speaker that never spoke English in the training set. The overall VulcanTTS is as follows:

## Text Processing

The input text characters were preprocessed and normalized before feeding to the TTS. We classified text preprocessing into two main parts. The written text to spoken text conversion module and text preprocessing module. The former refers to the conversion between written text into the spoken text, e.g., the date written as 1 Mar 2022 into “first of march two thousand twenty two” or \$100 into “one hundred dollars”. This module is different for each application where the user may prefer a different way of pronunciation. More detail of our current text normalization can be found here. The latter refers to the process of adding more information to text. After converting to spoken text, Thai and English characters were preprocessing. The Thai preprocessing are as follows:
1. 1.
Normalize Thai numbers to Arabic numbers e.g., ๑ to 1, and ๒ to 2.
2. 2.
ไม้ยมก, e.g., ดีๆ, มากๆ is preprocessed to ดีๆดี and มากๆมาก. We intentionally left ๆ mark as an indicator that the first syllable should be read shorter than the second syllable.
3. 3.
คำควบกล้ำ ทร, ทรัพย์, is replaced with “ซัพย์”. Since the set of these words are closed, we can define a lookup dictionary and perform pattern matching.
For English characters:
1. 1.
The grapheme-to-phoneme (G2P), DeepPhonemizer
$^{16}$
, was used to convert English characters into a phoneme representation. English language is required because of the homograph.
The common preprocessing are:
1. 1.
Normalize multiple whitespace into one.
2. 2.
Adding starting <sos> and ending <eos> tokens in front of the sentence.
3. 3.
The <sep> token is used as the pause mark between the sentences. For the English sentence, the <sep> is placed after “.”.
4. 4.
Encoded text into one-hot representation. The text symbols included IPA phonetic representations, Thai characters, punctuations, and numbers.
The Thai word boundary and other Thai pronunciation rules were learned by the model.

## Speaker Encoder

The speaker encoder model encodes the speech from each speaker into a latent representation. The key component of the encoding model is a loss function where the model must maximize the distance of a speech pair between a different speaker and minimize the distance of the speech pair of the same speaker. The two main types of a loss function which correspond to the training objective are: 1) classification objective (softmax variant), 2) metric learning objective. The current trend of the speaker encoding trained on both objectives.
Since our dataset consist of a noisy speech, our speaker encoding use the MagFace
$^{17}$
which improved from ArcFace
$^{18}$
by including the magnitude of the feature vector and explicitly distributing features in an angular direction which prevent model overfitting from noisy and low quality data. This function aims to enhance intra-class by having the high quality ones stay close to the class center while the low-quality ones are distributed around the boundary. In other words, this function to tackle the data variability problem where the low quality data degrade the encoding performance. To measure the quality, the magnitude of the feature vector is used. The figure below compares ArcFace and MagFace.
Without normalizing feature
$f$
(encoding feature) then:
$a=\vert\vert f \vert\vert$
. The magnitude ware angular margin is
$m(a)$
and regularizer
$g(a)$
which is a monotonically decreasing convex function with respect to
$a$
. Modifying ArcFace, the MagFace is defined as follow:
$L_{Mag}=\frac{1}{N} \sum_{i=1}^N L_i$
where
$L_i=-\log\frac{e^{s\cos(\theta_{y_i}+m(a_i))}}{e^{s\cos(\theta_{y_i}+m(a_i))}+\sum_{j\neq y_i}e^{s\cos \theta_j}} + \lambda_g g(a_i)$
The hyper-parameter
$\lambda g$
is used to trade-off between the classiﬁcation and regularization losses. In the study, the
$m(a)$
is a linear function where the lower bound and upper bound of
$\vert\vert a \vert\vert$
are 10 and 110. The
$g(a)$
is hyperbola.
For the metric learning objective, we use the Angular Prototypical loss developed on top of prototypical loss
$^{19}$
, having a data formation where the size of the support data is fixed in each mini-batch, and generalized end-to-end
$^{20}$
where the cosine-based similarity metric with learnable scale and bias is used instead of euclidean distance metric.
$S_{j,k}=w\cdot \cos(x_{j,M},c_k)+b$
$S$
is a distance function.
The speaker encoding is composed of convolutional block, transformer encoder, and the attention. The speaker encoding and the convolutional block designs is showed in the following figure:
After attention, the features are being 1) summed along the temporal feature called
$mu$
, 2) compute the standard deviation as followed:
$mu=\sum(h\cdot w)$
$std=\sqrt{\sum(h^2\cdot w)-mu^2}$
Where
$h$
is a feature vector and
$w$
is a weight from the attention model.
$\sum$
is performed along the temporal dimension, and then subtracted by mu and then square root. Both
$mu$
and
$std$
were concatenated and projected into a desired embedding dimension. The inclusion of the standard deviation feature was to improve the within group feature where both clean and noisy speech is represented.
During inference, the speaker embedding vector was length normalize along the feature dimension as show in the study
$^{21}$
that it improved end-to-end performance.

## Generator

The generator model converts the one hot encoding text symbols into a log-mel spectrogram. The architecture design of the generator model is as follows:
The main design follows the FastSpeech2 model using the Transformer for both encoder and decoder modules. We add another Transformer module as a text feature encoder. The text is fed to the model as one hot representation. The learnable table lookup layer was used to convert the one-hot representation into the text embedding vector representation. The Prenet encoded the embedding vector where one feature is encoded with and without neighbor features. Then, the encoding text feature was concatenated with the speaker embedding and then fed to the transformer encoder. Instead of an external aligner, our model used the monotonic alignment search proposed in Glow-TTS to search for the alignment between the encoding feature and the groundtruth log mel-spectrogram during the TTS training. The result from the aligner is used as the target for the duration predictor. Thus, during the inference, the model can perform non-autoregressive where the sequence length of the target log-mel spectrogram is predicted in one step. The encoding feature is then upsampling using the length from the alignment. The decoder is composed of a Transformer encoder where we added the auxiliary loss between the transformer block similar to Parallel Tacotron
$^{22}$
which showed the improvement of the mean opinion score in terms of speech naturalness. The final layer is a projection layer which projects the decoder feature into a log-mel spectrogram.
Three training objectives were used to train the generative model. The duration loss, the prior loss, and the mel loss. The duration loss is computed using the Huber loss
$^{23}$
between the duration prediction and the target duration from the monotonic alignment search. Since a noisy speech segment is presented in the dataset, the use of L2 loss will account for the outlier too much, while Huber loss is less sensitive to the outlier. The prior loss is computed from the result of the aligner to ensure that the aligner does not repeat or skip the text representation.
$\log{P_Z(z\vert c; \theta, A)}=\sum_{j=1}^{T_{mel}}\log \mathcal{N}(z_j;\mu_{A(j)},\sigma_{A(j)})$
$T_{mel}$
is a log-mel spectrogram length. The text encoder maps the text condition
$c=c_{1:T_{text}}$
into the statistics,
$\mu=\mu_{1:T_{text}}$
and
$\sigma=\sigma_{1:T_{text}}$
, where
$T_{text}$
denotes the length of the text input. In the formulation, the alignment function
$A$
stands for the mapping from the index of the latent representation of speech to that of statistics from
$f_{enc}:A(j)=i$
if
$z_j\sim\mathcal{N}(z_j;\mu_i,\sigma_i)$
. Last, the mel loss computes the mean square error between the predicted log-mel spectrogram and the target log-mel spectrogram. The total loss function is defined as follow:
The monotonic alignment search proposed in Glow-TTS searches the alignment between latent representation (feature encoder) and the prior distribution (target log-mel spectrogram). using maximum log-likelihood. To find the maximum log-likelihood, Viterbi algorithm
$^{24}$
was used. The illustration and the algorithm of the monotonic alignment search are showed as follows:
For more detail, please refer to the paper “Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
$^{25}$
The vocoder convert the intermediate representation, i.e., log-mel spectrogram, from the generator model. HiFi-GAN was used as the vocoder because of its fast computation, low-computation resource, and achieved a comparable human-quality speech synthesis compared to other models. HiFi-GAN is based on GAN (Generative Adversarial Network
$^{26}$
) which a generative model and a discriminative model were trained adversarially. The generative model generated the synthesis result and the discriminative model classified whether the input was synthesized or an actual output. For HiFi-GAN, two discriminators were used.
The generative model takes the log-mel spectrogram as an input and upsampling features into a speech waveform using a transposed convolution. After the transposed convolution block, the multi-receptive field fusion block (MRF) was used. MRF computed the feature over multiple lengths, e.g., kernel size and dilation channel, in parallel and the sum of those outputs. The MRF designed based in ResNet
$^{27}$
. The overall generative model architecture is shown in the figure below:
As the speech signal can vary in phase, the discriminator needs to be accounted for this effect. The multi-period discriminator (MPD) consists of several sub-discriminators each handling a portion of periodic signals of input audio. Additionally, to capture consecutive patterns and long-term dependencies, the multi-scale discriminator (MSD) was proposed in MelGAN
$^{28}$
. MPD is a mixture of sub-discriminators, each of which only accepts equally spaced samples of an input audio. The sub-discriminators are designed to capture different implicit structures from each other by looking at different parts of an input audio. The building blocks of MPD is a 2D convolutional block where the 1D audio signal was reshaped into 2D, allowing the model to compute gradients over temporal features. While MPD computed on a segment of speech sample, the MSD computed over a signal. MSD is a mixture of three sub-discriminators operating on different input scales. The overall discriminative model architecture is shown in the figure below:
The total training loss is computed as follow:
$\mathcal{L}_{Adv}(D;G)=\mathbb{E}_{(x,s)}\left[(D(x)-1)^2+(D(G(s)))^2\right]$
$\mathcal{L}_{Adv}(G;D)=\mathbb{E}_{s}\left[(D(G(s))-1)^2\right]$
where
$x$
denotes the ground truth audio and
$s$
denotes the input condition, the mel-spectrogram of the ground truth audio.
$D$
and
$G$
is a discriminator and generator model.
is a reconstruction loss where the mel-spectrogram was recomputed from the synthetic speech by the generative model. The computation is as follow:
$\mathcal{L}_{Mel}(G)=\mathbb{E}_{(x,s)}\left[\vert\vert\phi(x)-\phi(G(s))\vert\vert_1\right]$
where
$\phi$
is the function that transforms a waveform into the corresponding mel-spectrogram.
The feature matching loss is a learned similarity metric measured by the difference in features of the discriminator between a ground truth sample and a generated sample. Every intermediate feature of the discriminator is extracted, and the L1 distance between a ground truth sample and a conditionally generated sample in each feature space is calculated. The computation is as follow:
$\mathcal{L}_{FM}(G;D)=\mathbb{E}_{(x,s)}\left[\sum_{i=1}^T\frac{1}{N_i}\vert\vert D^i(x)-D^i(G(s))\vert\vert_1\right]$
where
$T$
denotes the number of layers in the discriminator;
$D^i$
and
$N_i$
denote the features and the number of features in the i-th layer of the discriminator, respectively.

## Dataset

Three datasets were used. Our own dataset is a Thai audiobook dataset from Benyalai Library
$^{29}$
consisted of 150 audiobooks or more than 900 hours. The transcription of the audiobook dataset was prepared by Vulcan Hero using our Vulcan Collaboratory Platform. The English datasets are from a public dataset which are 1) LJSpeech
$^{30}$
, 2) VCTK
$^{31}$
, 3) LibriTTS
$^{32}$
100, 300, and 500. The total duration of this dataset is more than 1300 hours.

## Experiment Setup

Our model trained on a batch size of 48. The model trained using AdamW optimization
$^{33}$
with the
$\beta1$
and
$\beta2$
was setted to 0.9 and 0.999. The learning rate was setted to 0.0001 and after 600k iterations. The norm gradient was clipped to 1.0.

## Model Hyperparameters

Hyperparameters
VulcanTTS
Vocab sizes
192
Embedding dim
256
Prenet hidden dim
256
Text encoder dim
128
Speaker embedding dim
256
Speaker encoder projection dim
128
2
Transformer text encoder layers
6
Transformer feature encoder dim
256
2
Transformer feature encoder layers
4
Duration hidden dim
256
Duration kernel size
3
Duration layers
3
Duration Projection dim
1
Output features (mel-bin)
80
Decoder dim
128
Speaker decoder projection dim
128
Transformer decoder dim
256
2

### References

$^{15}$
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
$^{16}$
https://as-ideas.github.io/DeepPhonemizer/
$^{17}$
Meng, Qiang, et al. "Magface: A universal representation for face recognition and quality assessment." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
$^{18}$
Deng, Jiankang, et al. "Arcface: Additive angular margin loss for deep face recognition." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
$^{19}$
Snell, Jake, Kevin Swersky, and Richard S. Zemel. "Prototypical networks for few-shot learning." arXiv preprint arXiv:1703.05175 (2017).
$^{20}$
Wan, Li, et al. "Generalized end-to-end loss for speaker verification." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
$^{21}$
Cai, Weicheng, Jinkun Chen, and Ming Li. "Analysis of length normalization in end-to-end speaker verification system." arXiv preprint arXiv:1806.03209 (2018).
$^{22}$
Elias, Isaac, et al. "Parallel tacotron: Non-autoregressive and controllable tts." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.
$^{23}$
https://en.wikipedia.org/wiki/Huber_loss
$^{24}$
https://en.wikipedia.org/wiki/Viterbi_algorithm
$^{25}$
Kim, Jaehyeon, et al. "Glow-tts: A generative flow for text-to-speech via monotonic alignment search." Advances in Neural Information Processing Systems 33 (2020): 8067-8077.
$^{26}$
Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems 27 (2014).
$^{27}$
He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
$^{28}$
Kumar, Kundan, et al. "Melgan: Generative adversarial networks for conditional waveform synthesis." Advances in neural information processing systems 32 (2019).
$^{29}$
https://www.benyalai.in.th/
$^{30}$
https://keithito.com/LJ-Speech-Dataset/
$^{31}$
https://datashare.ed.ac.uk/handle/10283/2950
$^{32}$
$^{33}$