Speech synthesis is the artificial simulation of human speech using computers or other devices (speech computer or speech synthesizer). It is the counterpart of voice recognition. It is used to translate text information into audio information. It uses Text-to-speech systems (TTS) to convert text into audio form. It is used in applications of voice-enabled services and unified messaging.
It is also used in assistive technology to help vision-impaired individuals in reading text content, the contents of the display are automatically read aloud to the user.
Christian Kratzenstein, a Russian Professor, Physicist and pioneer in speech synthesis. He invented an apparatus based on the human vocal tract to produce of five long vowel sounds in 1779.
VODER, Voice Operating Demonstrator was the first fully functional voice synthesizer by Homer Dudley and was shown at the 1939 World's Fair. The VODER was based on Bell Laboratories' vocoder (voice coder) research of the mid-thirties.
The simple method of speech synthesis relies on a machine analyzing the words of input phrases and grouping letters based on common usage together. These letters are then matched to a specific sound in the machine's database, which creates the synthesized audio. In this version of speech synthesis, the machine is merely converting the most common sounds that letters make together into audio, which results in the uneven and robotic tones and odd mispronunciations present in simpler systems.
In order to introduce smooth and more natural speech patterns, modern speech synthesis systems have begun to deploy Hidden Markov models to determine the most likely phrase that needs to be "spoken" by the synthesizer. Hidden Markov models are finite state machines that can be used to analyze segments of text that are broken down into a series based on time. The state machine determines the actual word that has been typed using phonetic analysis and its place within the typed phrase based on probability. This allows the machine to string the sounds along in a more naturally paced manner that matches the intent of the text to the audio being produced.
The four states of analysis to produce audio based on Hidden Markov models are text, phonetic, prosodic, and speech. Text analysis converts the text into a form usable by the machine and utilizes probability to determine the linguistic meaning of the text and the context of the text. Phonetic analysis converts the literal typed letters into phonetic symbols that the machine can relate to certain sounds. Prosodic analysis seeks to use the linguistic meaning in conjunction with the context and phonetic sounds to determine the most probable rhythms, stress patterns, and intonation. Speech analysis combines the results of the previous states to generate the speech signal.
A Flexible Rule Compiler for Speech Synthesis
Wojciech Skut, Stefan Ulrich, Kathrine Hammervold
A Short Introduction to Text-to-Speech Synthesis
Hidden Markov Model based Speech Synthesis: A Review
Sangramsing Kayte, Monica Mundada, Jayesh Gujrathi
History of Speech Synthesis (Wolfgang von Kempelen's speaking machine and its successors)
Progress in animation of an EMA-controlled tongue model for acoustic-visual speech synthesis
Ingmar Steiner INRIA Lorraine - LORIA, Slim Ouni INRIA Lorraine - LORIA
Review of Speech Synthesis Technology
Documentaries, videos and podcasts