The Analysis | speecs

We started off our project by collecting voice samples. In order to get enough variations in the human voice, we had six people, three male and three female, speak a pre written sentence. The sentence we used was “Hello, my name is Speecs. I am a digital voice. What did you do last weekend?” We chose this sentence because originally we planned on synthesizing a voice, but it turned out to be a good choice for voice morphing because it covers a lot of different sounds of the human voice.

Once we got our voice samples, we plotted our signal in the time domain, and the frequency domain via the discrete fourier transform (DFT) for each person. We quickly figured out that although the DFT is a valuable tool in signal processing, we could not discern any useful data from the DFT plot. Plots for the original signal in the time domain and frequency domain are shown below this paragraph. This lead us to explore new techniques that would be relevant to speech processing. The two techniques we used, which are covered mathematically in the theory section, were the cepstral transform and linear predictive coding.

When implementing the cepstral transform, we had a couple options in matlab. The first was to use the built in cceps function, which computes the complex cepstrum of the signal. After using this function, and trying to understand the plots that it gave us, we determined it was not the best option because we could not see well defined peaks where we expected them. We soon discovered that our input signal was too long to use properly with the ccpes function, so we recorded another set of sound samples. We had one male and one female each record vowel sounds, and then used these samples to analyze their cepstrum. Although the cepstral plots were better for the vowel sounds than the cepstral plots for the original signals, we found that we were getting better results by implementing the cepstral transform as shown below:

aaaBoyCeps = abs(ifft(log(abs(fft(aaaBoy)))));

The main difference between this implementation and the ccep implementation is that there is no use of the complex logarithm in this version. Plots for the “eee” male sound and “aaa” female sound are shown below. You can see periodic peaks in both plots, which represent the harmonics of the voices for those sounds. When interpreting the cepstral plots, the low quefrency region represents the transfer function. These are the large peaks in the beginning of the plot. As we move towards higher quefrency, we start to see the voice, or the source signal. It is easy to interpret in this case because we are only using one sound.

When we tried using cepstral techniques to implement voice morphing, we struggled to effectively separate the source and transfer function. As a result, we determined that given our signals we could not effectively implement voice morphing using cepstral techniques. We found them valuable for analyzing smaller speech segments and understanding what was going on in the voice, but we decided to use linear predictive coding(LPC) for our voice morphing.

LPC was used in the majority of our voice conversion attempts. After applying a pre emphasis filter to our audio signal, we obtained the LPC filter coefficients. These represent the formants in speech, which characterize a person’s voice. We applied LPC to two signals, extracted source and filters from each, and swapped the components to try and reproduce the source signals with different voices. The result of this process is presented below.

LPC Voice Approximation - SPEECS

00:00 / 00:00

There were some functions in Matlab that we implemented, primarily the lpc(x,y) function. Using LPC techniques, we found the filters and excitation signals for two separate voice samples. Ideally, filtering the excitation of signal A with the inverse filter of B should output the words person A said in person B’s voice. We tried this with several different signal combinations and multiple orders for the LPC polynomial. Few of our trials with this filtering output a signal that was significantly different from the original signal. They generally sounded similar to the original speaker’s voice, either lowered only slightly in pitch or volume. This was a smaller change than we expected, not the complete voice switch that we had hoped for.

To attempt to solve this lackluster performance, we broke the signal into smaller samples, segmenting the LPC conversion process, then recombining the samples after. We hoped this would increase the accuracy of the filter we were creating by localizing the change it would be causing. This took some work in order to sample the correct number of audio values every time and never go over the bounds of the signal. In the process, the tail end of the soundbite was lost, but only a small amount. Our most promising trial of this was mixing a woman’s words into a man’s voice, or high pitch to low pitch. This process created a new audio signal that still retained most of the original female speaker’s qualities, but slightly deepened. Along with this, however, was added a significant amount of noise. The speaker was still understandable, but it was not an attractive recreation.

Download the MATLAB code here