Solution
Last updated
Last updated
Audio book development using artificial intelligence is a revolutionary technology that can assist e-book publishers in lowering production costs and shortening e-book time to market by more than 90%. We created a Text-to-Speech AI model (TTS)With Thai natural language. In general, the quality of AI voice is pretty similar to that of human reading voice.
Our methodology is to train an AI model using millions of pairs of text and speech in order to teach AI how to synthesize voice in Thai. The AI will be trained to mimic the genuine Thai language voice as closely as feasible.
After we have finished training the TTS model, we will train her to read the script of our e-book, which is in machine-readable format (ePub, text). Then, using a unique technology, we will capture our AI's voice in the form of an audio book. This significantly reduces the production time of a one audio book from weeks to minutes.
The majority of the production costs are spent on training AI; if she is good enough to read, we will no longer require human voice actors for audio book creation.
Despite the advent of Vaja text to speech (TTS) in thai voice is not yet admissible in broad applications. Though it has demonstrated a remarkable quality for general speeches, it fails to address specific focuses in individual domains. Besides, its infrequent update does not correspond to the active field of deep learning, where now the state of the art title changes hands from hidden markov model, LSTM, Wavenet, to Tacotron 2(as of 2019) in just over two years. This is the motivation that prompts us to look at this thai TTS problem again, despite many have claimed that it has been solved.
Perhaps what pulls its back was the inherent difference between thai and english voice. Wavenet, while highly remarkable in english, depends mostly on its acoustic features, something that requires well-trained linguists to extract. This complication in data preparation makes it hard to provide sufficient amount data to feed the ravenousness of these deep learning models, especially for thai voice.
Even in english, the issue persists. Things could be simpler if we can condition the model on the text directly. And this was the idea. For example, Deep voice 3, Char2Wav, and Tacotron proposed models that could automatically extract acoustic features from text, without human intervention. And from them, we use standard acoustic algorithms to convert them back to wavesound. Like all good deep learning practice: “stage what you want and let the algorithm derives the features for you.” These algorithms were able to produce high quality sound, but the only problem was, they were not as good as Wavenet.
Text-to-Speech AIThe average consumer has less time to consume information. When data transfer technology over the Internet became faster, "Listening" became popular as an alternative to "Reading." According to statistics, audio book sales have increased dramatically in recent years.
The number of audio book publishers in Thailand is fairly low. According to 2021 data, we only have roughly 1,000 audio books in our online store, which is less than 1% of all books sold in the market.
According to Statista, in 2016, 79,000 audio books were released in the United States, which has the most audio book listeners in the world, which is 34% more than general e-book published.
Our audio book business model is fairly simple: we connect with book publishers in Thailand and produce audio books using our TTS model. We devised the following revenue sharing scheme: -
20% to Vulcan
80% to book publishers
We also designed an audio book revenue model in the form of a subscription model, which allows monthly subscribers to listen to our audio books unlimitedly. However, we must study the legal constraints on revenue sharing among different parties.
Low production cost
Fast time to market
Social innovation
Advance audio book platform
http://www.vajatts.com/overview
https://en.wikipedia.org/wiki/Hidden_Markov_model
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
Oord, A. V. D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., ... & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., ... & Saurous, R. A. (2018, April). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4779-4783). IEEE.
Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., ... & Miller, J. (2017). Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654.
Sotelo, J., Mehri, S., Kumar, K., Santos, J. F., Kastner, K., Courville, A., & Bengio, Y. (2017). Char2wav: End-to-end speech synthesis.
Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., ... & Le, Q. (2017). Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.