Speech Synthesis Essay

1091 Words3 Pages

Speech Synthesis

Speech Synthesis is the process of synthesizing speech from some sort of symbolic linguistic representation. Text to speech synthesis systems can be divided into two broad categories. They are:

Rule-based techniques.

Data-driven approach

They are discussed in details in the following subsections.

Rule-based techniques

Rule-based techniques try to synthesize speech using a fixed set of rigid rules mostly related to how vocal system acts during the production of specific phonemes. They do not usually use human data. The major two rule-based techniques are:

Formant Synthesis.

Articulatory Synthesis.

Formant Synthesis

This was a widely popular technique in the 1980s. In formant synthesis, speech is treated as
Here, instead of storing individual phoneme sounds and mapping them to the phonemes found in the text, parametric models for phonemes in different contexts are saved. The simplest way to describe statistical parametric speech synthesis would be something like this: it generates the average of some set of similarly sounding speech segments. [7]

Here, actually, speech is decomposed into parameters like acoustic features such as fundamental frequency, the shape of the waveform, aperiodic energy etc and duration features related to contextual prosody. And the text is decomposed into various linguistic information. Then Hidden Markov Model or Deep Neural Networks can be used who will learn how to predict parameters such as acoustic features and duration features from the linguistic information of text data during the training phase. [8]

How Statistical Parametric Speech Synthesis Works At first, the text is broken down into phonemes and individual linguistic representation for each phoneme is created. The linguistic representation of a phoneme contains the phoneme itself and some information about its prosody in the current context. Then from each of the linguistic representation, some parameters are generated by models which are later used to synthesize speech. More discussion about linguistic representation is done in section

Open Document