Artificial simulation of human speech using computers or other devices
The simple method of speech synthesis relies on a machine analyzing the words of input phrases and grouping letters based on common usage together. These letters are then matched to a specific sound in the machine's database, which creates the synthesized audio. In this version of speech synthesis, the machine is merely converting the most common sounds that letters make together into audio, which results in the uneven and robotic tones and odd mispronunciations present in simpler systems.
In order to introduce smooth and more natural speech patterns, modern speech synthesis systems have begun to deploy Hidden Markov models to determine the most likely phrase that needs to be "spoken" by the synthesizer. Hidden Markov models are finite state machines that can be used to analyze segments of text that are broken down into a series based on time. The state machine determines the actual word that has been typed using phonetic analysis and its place within the typed phrase based on probability. This allows the machine to string the sounds along in a more naturally paced manner that matches the intent of the text to the audio being produced.
The four states of analysis to produce audio based on Hidden Markov models are text, phonetic, prosodic, and speech. Text analysis converts the text into a form usable by the machine and utilizes probability to determine the linguistic meaning of the text and the context of the text. Phonetic analysis converts the literal typed letters into phonetic symbols that the machine can relate to certain sounds. Prosodic analysis seeks to use the linguistic meaning in conjunction with the context and phonetic sounds to determine the most probable rhythms, stress patterns, and intonation. Speech analysis combines the results of the previous states to generate the speech signal.
