Development of devices and algorithms for voice synthesis

One of the first attempts to create synthetic speech was made more than two hundred years ago, in 1779, by the Prussian professor Christian Kratzenstein. The inventor built acoustic resonators in his St. Petersburg laboratory. The device consisted of several vibrating reeds that were acoustically similar to a human vocal apparatus. This device could artificially reproduce five long vowels.

Twelve years later, in 1791, an inventor from Vienna named Wolfgang von Kempelen built a more complex machine, modeled on various human organs that make it possible to synthesize speech. The machine had a pair of bellows to simulate lungs, a vibrating reed shaped like vocal cords, a leather tube for the vocal tract, two nostrils, leather tongues and lips. By controlling the shape of the leather tube and the position of the tongues and lips, von Kempelen could produce both consonants and vowels.

Half a century later, in 1838, the English scientist Robert Willis discovered a connection between individual vowel sounds and the structure of the human vocal tract. This inspired other researchers to invent voice synthesizer devices. One of them was created by Alexander Graham Bell and his father in the late 19th century.

At the 1939 World’s Fair in New York, Homer Dudley presented the world’s first electric speech synthesizer VODER (Voice Operating Demonstrator). VODER worked on the same principle as modern devices based on the source-filter speech model. True, the clarity of the voice stream generated by it left much to be desired.

Voice Operating Demonstrator

The operator selects one of two basic sounds using the wrists: a humming sound and a hissing sound. Buzz was the basis for vowel and nasal sounds. The hissing sound was intended for those sounds that are associated with consonants. These sounds were then transmitted through a system of filters that were selected by the user by selecting the appropriate buttons on the keyboard. These sounds were combined and played through a loudspeaker.

Additional filters could be selected for sounds not reproduced by buzzing or hissing, such as p, d and x. Different words could be combined into different sentences by manipulating keys and sounds. Different expressions and pitches (controlled by the foot pedal) could be added depending on the type of question being asked.

Then there was the time of formant synthesizers such as PAT (Parametric Artificial Talker), OVE (Orator Verbis Electris) and elementary articulatory synthesizers such as DAVO (dynamic analog of the vocal tract).

The next stage in the development of the speech generator was the development in 1968 in Japan by Noriko Umeda and his colleagues of the world’s first full-fledged text-to-speech system for the English language. Since then, special devices have begun to generate intelligible speech.

In the 80s and 90s, neural networks and hidden Markov models were widely used to synthesize speech flow. Using these tools, the researchers sought to obtain more complex sounds that would be as similar as possible to the human voice.

Modern voice synthesis technologies are evolving through methods based on generative adversarial networks (GAN).

The economic effect of artificial speech reproduction is obvious (for example, in the production of video games or audiobooks). The main focus now is on making it easier to use multiple voices that sound differently. Some virtual assistants, such as Alexa and Siri, require large data packets to create a customized voice. Algorithms are being developed, both for the emotional dubbing of the text, and for the generation of speech of the required timbre, tempo and other characteristic features inherent in specific people.