The Theory | speecs

Homomorphic Deconvolution

Homomorphic deconvolution is used in signal processing as a method of separating components from a signal. It is a system that accepts a signal composed of two parts, and returns a signal with one of those components removed. In regards to speech processing, we can use homomorphic deconvolution methods in order to separate components: the transfer function (filter) and excitation signal (source). The purpose of deconvolving the speech signals we processed is to extract both the transfer and excitation functions to be used in further processing.

Image Courtesy of ece.ucsb.edu

Source Filter Model of Speech

Before we discuss our methods of deconvolution and speech processing, it is useful to present the source filter model of speech. This model describes two components of speech: the words, which is the source, and the voice, which is the filter. This allows us to view speech as an output signal of convolution. By passing the source signal through the filter (convolving the two signals), we produce our output signal, speech. Since speech is made up of numerous syllables and sounds, we can view each syllable and sound in a sentence as its own LTI system with its own impulse response. These impulse responses describe the different sounds and voice qualities of a person’s speech.

Cepstral Analysis

One method of speech signal processing is cepstral analysis, a form of homomorphic deconvolution. The cepstral process includes applying the functions described in the block diagram on the right, and equations below.

By taking the logarithm of the magnitude of the discrete fourier transform of a signal, we are able to see the separation of the excitation signal (high quefrency region) and transfer function (low quefrency region). This is because when we apply the logarithm to the multiplication of two signals in the frequency domain, we can split them using addition. Filtering the low quefrency region yields the transfer function, and filtering the high quefrency region yields the excitation signal.

In theory, when we separate the excitation and transfer signals from two audio files, we can convolve the excitation signal of audio file 1 with the transfer function of audio file 2. This should give us a voice conversion of sorts, where we hear the words of audio file 1 spoken by the voice in audio file 2. In our project, we found it very difficult to use this method to implement voice conversion. The graphs we obtained from cepstral analysis were informative for some signals, which we discuss in the analysis section. Cepstral analysis is very good at describing the characteristics of sounds. It is an extremely valuable tool in speech processing, however we chose to use linear predictive coding for our voice conversion process as it was more straightforward to implement.

Linear Predictive Coding

Another form of homomorphic deconvolution is linear predictive coding. This is a significantly different method than the cepstral transform. It can provide extremely accurate estimates of speech parameters by estimating the current speech sample as a linear combination of past speech samples. The equation describing the process is shown below.

The order of the linear predictor is given by p, which indicates how many past samples will be used in predicting the next sample. It does this by minimizing the least squares error. After obtaining the filter coefficients, we can create a filter with poles at the coefficient locations. This filter represents the transfer function, or voice qualities. We can then use this filter to obtain the excitation signal, or source signal. We can then switch the components of two audio signals to get a voice conversion. This is the method we chose to implement for our project.