Text-to-speech (TTS) is the artificial production of human speech based on input text. TTS can be viewed as a sequence-to-sequence mapping where text gets transformed into one of its many possible speech forms.
However, this process is not as straightforward as it might seem. Any user of early sat nav systems should be able to recall numerous instances when the technology mishandled pronunciations, often resulting in an embarrassing slip. Nuances in pronunciation become particularly important when dealing with delicate news material or financial research.
The main challenge of perfecting TTS lies in the nature of text itself, as it's inherently ambiguous and under-specified with regards to many aspects of speech.
In this post, we outline the key challenges presented in processing input text and describe how a text preprocessor can help.
Ambiguity in written language
We can roughly categorize text into natural language and non-standard words (NSWs). Loosely speaking, natural language is any text that can be read out loud immediately, whereas NSWs have to be verbalized first.
For example "April first" can be read out loud as is, whereas "1st April" has to be converted into its natural language form before it can be read out loud.
There are many types of NSWs that need conversion, such as dates ("01/01/2001" into "January the first two thousand and one"), currencies ("$100" into "one hundred dollars") or units ("10m" into "10 meters" or "10 million"), just to name a few.
But even text without NSWs might require some form of disambiguation. For example, abbreviations need to be expanded into their natural form, like "St." into "Street" or "Saint". Another example are acronyms and letter sequences which can be read as words (such as "NASA") or by pronouncing each letter separately (such as "FBI").
The process of disambiguating and expanding natural language and NSWs is commonly referred to as text normalization. The output of which can be interpreted as a sequence of graphemes - letters that represent sounds in a written language.
Ambiguity with respect to pronunciation
The process of normalizing text into a sequence of graphemes removes some of the ambiguity, albeit not fully.
Take heteronyms - words that are spelled the same but pronounced differently depending on the context they appear in. For example, "lead" can refer to being in charge, in which case it's pronounced like "leed". However, "lead" can also refer to the metal, in which case it is pronounced like "led".
Pronunciation can also vary based on regional or personal preferences. For instance, the word "either" can be pronounced as "ee-thur" or "eye-thur". Or consider the brand name "Nike", which in the US is commonly pronounced "nai-kee", whereas in other parts of the world people might say "nyke".
Furthermore, unknown words can appear over time, such as "gif" a few decades ago. Some people use a hard "g" (like "gift"), while others use a soft "g" (like "giraffe"). Sometimes these new words deviate from existing orthographic rules, which makes inferring their pronunciation challenging.
The examples above motivate the need to better specify the intended pronunciation of input text. This is commonly achieved with phonemic transcription, a process which transcribes the grapheme sequence into a sequence of phonemes.
Phonemes are the smallest units of sound that can distinguish one word from another, and are represented by special alphabets such as the International Phonetic Alphabet (IPA). With hundreds of symbols it allows for more granular control than mere grapheme-based input.
Ambiguity with respect to various aspects of speech
Even after text normalization and phonemic transcription, a reader may often have little information about many of the aspects of speech, such as timbre, which characterizes who the speaker is (e.g. their age) or where they come from (e.g. their regional accent or dialect).
It is often unclear how a given text should be read out loud in terms...