EEL 6586 Speech Recognition

 

Proposal

The purpose of this project is to implement the speech recognition in a particular real world application: a spoken instruction is given through microphone, and a robot is to do a specific thing--open /close a door, close / open a window, turn on / switch off the radio, etc.-- accordingly. The robot, its surroundings and all the actions of it are shown by animation. The template has several sentences, representing different orders to the robot. Any person can say any one  sentence from the template to have the robot do a thing. This kind of phenomena can be seen in many industry implication: the voice dialing in a car, telephone credit card information checking, via voice computer input system (which is more complicated), etc.

First the input instruction is filtered to get rid of the noise, the features are extracted, the voiced, unvoiced phonemes are separated and the word is recognized. Then the whole sentence can be estimated, and a classification can be made. Since different people can give different pronunciation even for the same sentence, the HMM is used to improve the accuracy of  recognition.

In this project, different methods are used for each step to compare their different results. For example, energy and  zero-crossing are all used for speech and silence separation and LPC and mfcc are used in feature extraction. Different results are all given and analyzed, the best is used in presentation.

 

0. Abstract

The speech control system is mainly cope with short word input instruction, generally one or two words. This project uses a longer one --- some 3-to-4-word-short-sentences as input. A sequence of steps of noise reduction increase the stand for noise circumstance; fisher linear criterion, HMM model and cross-correlation methods all give good IV recognition results with lpc, mfcc or cross-correlation of lpc as features respectively; a further step is taken to deal with OOV problem, which make this system seems more " intelligent ".  

 

1. Introduction 

Speech control is a system combining speech recognition and control, where the former is the critical step for better controlling. Speech recognition is one of the important areas in digital speech processing. The study of speech recognition is a part of a question for "artificially intelligent" machines that can "hear" and "understand" the spoken information. The conventional method for speech recognition is HMM (Hidden Markov Model). In this technique, the feature vector of speech is extracted, and the recognition result depends on its log likelihood to every word in the vocabulary. The largest log likelihood decides its recognition result. In this project, we use several different methods for recognition. 

The speech recognition can be divided into two categories according to the instructions given to the learning matching: In Vocabulary (IV) and Out of Vocabulary (OOV). If all the instructions are in the code book, the recognition is easier. But as it always happens, people often don't know the pre-installed instructions when they use the voiced controlling matching, which needs techniques on Out of Vocabulary to give rejection. This project is focused on these two aspects .

Since speech instruction can be given at any circumstances, the robustance to noise is important. The second part is on the self-adaptive noise reduction using Wiener filter, as well as preprocessing with pre-emphasis and 2nd derivative of signal. Part three is mainly concerns In Vocabulary detection : HMM, fisher linear criterion, and cross-correlation method. Part four is focused on Out of Vocabulary, the method, analysis and results are given. The last parts is the conclusion. In appendix, the demo is attached.

In this project, the vocabulary ( training data) are my own voice. There are 6 instructions totally, each of which has 30 sentences in training and 15 sentences in testing: 

Open the door    Close the door    Open the window    Close the window    turn on the radio    switch off the radio 

There are another 50 OOV utterances that are not in above categories and are used for OOV test.

 

2. Noise reduction

The purpose of this process is to make the input speech robust to different circumstances. There are 3 steps to do it. First, we use the pre-emphasis filter to decrease the voice's lower frequency branch, which often comes from circumstance background when recording. Secondly, we make a second derivative to the pre-emphasized voice to delete some relatively stationary signal and make some compensation to the losses caused by pre-emphasis. At last, we employ the wiener filter to cope with  AWGN. The noise that we usually met is AWGN, which is statically independent and orthogonal. Their instantaneous amplitude, phase, and time delay are unknown. The  Wiener filter can be used to reduce it. 

fig. 1

As shown in fig.1, x(n) = s(n) + N(n), where N(n) is AWGN. A. F. is an FIR filter of M order. The coefficients of AF are such that make the expectation of output e(n) minimum. According to the relation between ergostic mean and statistical mean, N(n) are zero mean and uncorrelated during long observation, they are "canceled" each other at output y(n). The associated code is delet_noise.m

The following shows the effect of noise reduction on a speech of " We love this game":

        fig. 2  original                                                                  fig. 3 after pre-emphasis    

                                       

 

   fig. 4 after derivative                                                               fig. 5 after wiener filter    

                                       


3. Pattern recognition -- In Vocabulary

3.1: Fisher Linear Criterion 

I use the LPC coefficients as features. There are 10 coefficients for each sentence. The noise is first drifted from the signal, and the end-point method (using energy detection) is implemented to get rid of the silenced parts before and after the speech signal. The LPC is calculated on the whole sentence. The coefficients are spanned in 10-D space, and the Fisher linear criterion is to project it on to a single line , which gives the most separation among different sentences.

The test results on the training data is perfectly reached at 97.22%, which means only 5 errors. But the accuracy rate of testing speech is only 53%. The reason is  that without windowing on each properly overlapped data, we lose too much  information, thus make the test results change too much from the training ones. Actually, even the training data contains insufficient information, so the results is too much sensitive to any small change in input. Table one shows the result of the training data. The digits 1 to 6 represent different sentences (instructions). From fig.5, we can see that this projection is perfect for training data, and no rejection happens.

 

table 1.  training results: 

  1 2 3 4 5 6
1 29 0 0 1 0 0
2 0 30 0 0 0 0
3 0 0 26 4 0 0
4 0 0 0 30 0 0
5 0 0 0 0 30 0
6 0 0 0 0 0 30

 

fig. 5 : Linear projection for each pair of classes in training data

 

table 2. testing results:

  1 2 3 4 5 6
1 6 0 1 6 2 0
2 7 4 0 4 0 0
3 1 0 1 12 0 1
4 2 3 0 7 3 0
5 0 0 0 0 15 0
6 0 0 0 0 0 15

 

3.2 HMM

The process of this part is similar to hw4,  and the codes are mainly from those provided in hw4. Since the instructions are all short sentences, I just use the whole sentences as input of hmm, without separating them into small parts. The training result is 100%, and 100% on testing data. Noise reduction is used but there is no end point detection. The silence parts are considered as part of the whole instruction. For each sentence, 30 states are adopted. The results and some interpretation are shown in appendix 1. 

 

3.3 Cross-correlation method

This method is my own method by referring to [2] (but quite different from it). HMM and DTW are very complicated and time consuming. This new attempt simplifies the algorithm and make the real processing time shorter. 

In speech recognition, the  major work is to use an appropriate way to extract the features of the characteristics of a spoken word. If the features of a certain word are unique, the word is easy to be retrieved. Linear predictive coding (LPC) is one of the kinds of features and also one of the most popular coding techniques in speech signals. The LPC technique is able to operate at low bit-rate using relatively modest computational resources while providing a very usable coded representation of the original speech signal. 

In [2], the author uses autocoreelation of LPC as feature, and assume that the boundary of each class is not hard and clear but vagure. So, he implement fuzzy logic method to calculate the memership grade of each class, where he defines the membership function as a triangle, with the mean training feature value taking membership grade 1, and min and max training value taking membership grade 0.2. The problem with this is, first, the author just treats cases of single word recognition, which is not appropriate for my case; second, the seperation of different classes mainly relies on the divergences among them, which is proved, in this project, tiny and causes highly overlaps. So, I use CROSS correlation of LPCs as feature and the fuzzy method is given up because the poorly estimated membership grade function may leads to worse result.

In the project, in order to get more accurate result, the noise reduction technique in 2 is used to reduce noise in the first step, and then the end-point is calculated, and only the features in speech part are considered. The LPC coefficients are extracted from the  speech signals with overlapping windowing. 10 LPC coefficients are calculated in every window. Picking up one of the 10 LPC coefficients from every window, we can get features through all the speech period. For example, if the speech is separated by 190 windows and each window has 10 LPC coefficients, the ith LPC coefficient of each window can be picked up and we can totally get 190 coefficients which  can represent the speech features through all the speech period. According to this method, I can find the best features representing each sentence.

The 3rd and 4th LPCs of two sentences in utterances "open the door" and "close the window" are shown in fig.6.  We can find that the LPCs are similar for the same utterance group. The 3rd and 4th LPCs of one sentence in each of the six utterances are shown in fig7. We can find that the LPCs are very different for the sentences in the different utterance groups.

fig.6: the 3rd and 4th LPCs of two "open the door"(left) and two "close the window" (right) 

  open the door.gif          close the window.gif

fig 7: the 3rd and 4th LPCs of one sentence in each of the six utterances




 In training data, there are 30 sentences for each utterance group. For the same utterance, the features of the 30 sentences are similar, which means they are correlated with each other. If we pick the LPCs of one sentence and calculate its cross-correlations with other sentences of the same utterance group and different groups separately , we can find that the cross-correlations between sentences within the same utterance group is very different from those with the sentences in the different utterance groups. The auto-correlation of the same utterance is shown in fig.8. The cross-correlation within one group (the sentences that have the same utterance) is shown in fig.9. The cross-correlation between different group is shown in fig.10. It is obvious that fig.9 is similar to fig.8, because the sentences in the same group is more similar.


fig.8: the Auto-correlation of the same utterance

 

fig.9: the cross-correlation within one group

 

fig.10: the cross-correlation between different groups

 

The method for recognition is to calculate the difference of the first two largest peak values of cross-correlation. For a testing sentence, its cross-correlations with 30 sentences in each utterance is calculated and 30 difference values will be obtained. The utterance which has the largest sum of difference is the recognition result. Table 3 gives the training data result and Table 4 gives the testing data result.

Table 3: training result

  1 2 3 4 5 6
1 29 0 1 0 0 0
2 0 30 0 0 0 0
3 0 0 28 2 0 0
4 1 4 3 21 1 0
5 1 0 1 0 28 0
6 0 0 0 0 0 30

The accuracy rate of training speech is 92.2%. The training size is 6*30.

Table 4: testing result

  1 2 3 4 5 6
1 12 0 0 0 3 0
2 1 13 0 0 1 0
3 0 0 15 0 0 0
4 1 3 1 9 0 1
5 0 0 1 0 14 0
6 2 0 0 0 2 0

The accuracy rate of testing speech is 82.2%. The testing size is 6*15


4. OOV

A user not familiar with the system may utter out of vocabulary words, and the controlling machine is to give weird action whenever it receives some "voice", which, thought, has quite often no connection to the control, because any voice passing through a system can give an output score, and cause the machine make judgments. So, we need to find a score measuring the confidence of a recognized word.

The basic idear for seperating vocabulary words and out of vocabulary words are : the likelihood difference between best and 2nd best results of IV words are smaller than those of OOV words, because of the un-matched model of the OOV inputs. In this project, I used the HMM as model. According to [1], it is not appropriate that the log likelihood of the word itself acts as measurements, due to the difficulty of setting the threshold. In [1], standard Log Likelihood Ratio (LLR) and augment LLR are used. The standard LLR is calculated as follows:

nLLR =1/N[log.P(best|O) – log.P(2nd  best|O)]

Here, N is the length of input utterance, log.P(best|O) is the largest log likelihood, and log.P(2 nd best|O)] is the second largest log likelihood.

This standard LLR can be added some additional information to improve the reliability. Due to the relative large likelihood difference between best and 2nd best results, the recognition result would not change too much if the input utterance is changed a little bit for IV word. But for an OOV word, it is highly probable that the results of changed input may be difference from that of original one. For this reason, we can employ the perturbed version of input to improve the robustness of confidence score. Several methods can be applied for perturbing the input feature vector. For example:

c1 = k1*c;

c2 = c - k2*mc;

c3 = c -  k3*Oc;

Here: c: feature vector, mc: mean vector of feature vector for input speech, and Oc: standard deviation vector of feature vector for input speech. k1, k2, and k3 are constant values which should be adjusted such that the percentage of discrepancy between recognition results from original and perturbed input feature vectors ramain <10% for IV word.

After perturbed by c2, if the recognition result is changed, a certain value will be added to LLR. The method is shown as follows:

LLR A   = LLR  +K        if Wo = Wp

LLR A   = LLR             if Wo!=Wp

Here, Wo is the recognized word from the original input feature vector. Wp is the recognized word from the perturbed input feature vector. They can be IV or OOV. A threshold for LLR A is set by training IV sentences and OOV sentences. When a testing utterance is inputed, its LLRA is calculated. If its LLRA > the threshold, it is considered as IV sentences. If its LLR A < the threshold, it is OOV sentences. 

When I tried the LLRa method, I found it did not give me the perfect result, because it was presented and used in [1] for single word, while in my case, the objects are all short sentences. The standard LLR can not help because the LLR for IV words and OOV words are almost the same. So, I made an alternative to LLR. I use the ratio between LLR difference and the largest log likelihood value as features :

ratio_diff =  [log.P(best|O) – log.P(2 nd  best|O)] / log.P(best|O) 

The ratio_diff of IVs give values almost bigger than 0.2 while most of those of the OOVs give value less than 0.2. The following shows the overall distribution of ratio_diff of IV and OOV, and the detailed position of classifier:

fig. 11

OOV result 1.jpg

fig. 12

OOV2.jpg

In above figures, the blue star represents IV sentences, and red circle represents OOV ones. The following shows the relationship between the false alarm rate and detection accurancy rate. Whenever a OOV is judged as an IV sentence, there is a false alarm; whenever an IV sentance is claimed as IV, there is a correct IV-OOV  detection.

fig. 13

OOV3.jpg

From fig. 13 we can see, that this method leads to more than 95% of in vocabulary recognition, with only less than 5% out of vocabulary error rate. It is better than the result in [1]. Besides its high efficiency in OOV detection, the algorithm of this method is simpler than LLRa, which needs to calculate perturbed value. Actually, if we look at the log likelihood value directly, (which is not recommended in [1]), we can get more better result with easier calculation. The fig. 14 shows the obvious separation in terms of log likelihood:

fig. 14

OOV4.jpg


If we take -1000 as threshold, the false alarm is decreased to zero, and no in vocabulary instruction is to be denied by the contoller.


5. Conclusion

In this project, different methods are used to make in vocabulary recogniton, each of them uses the noese reduction and end point ( HMM does not need end point) as preprocessing. LPC and mfcc are used as features respectivly. Fisher linear criterian shows excellent seperation in training. Cross correlation of LPC is a more robust feature than LPC itself, and it reflects quite divergence among different classes. HMM gives the best results both in IV and OOV detection, and the ratio of difference between 1st and 2nd largest log likelihood is powerful in seperating OOV and IV.


Appendix 1 : HMM results in training and testing

1. training results: 100%

In each fig, data1 represents the log likelihood of current data passing  through the model 1, and data2 represents 
the log likelihood of current data passing  through the model 2, so on and so forth, where the corresponding relations are:

data1 : hmm model of Open the door
data2 : hmm model of Close the door
data3 : hmm model of Open the window
data4 : hmm model of Close the window
data5 : hmm model of Turn on the Radio
data6 : hmm model of Swith off the Radio

X axle represents the number of data in each utterance group, i.e. 30 in each training group and 15 in each testing group. Each fig represent each individual utterence group. So, the top line in first fig shows that all the training data coming from group " Open the door" give the largest log likelihood when they are passing throuth the model one, hmm model of 
Open the door, which means 100% correct for this group.

fig. : Open the door
train1.fig

fig : Close the door
train2.gif

fig. : Open the window
train3.gif

fig. : Close the window
train4.gif

fig. : Turn on the radio
train5.gif

Switch off the radio
train6.gif

2. testing results : 100%


Open the door
test1.gif

Close the door
test2.gif

Open the window
test3.gif

Close the window
test4.gif

Turn on the radio
test5.gif

Switch off the radio
test6.gif
 


Appendix 2 : Demo of speech control

The demo codes is a .zip file which can be downloaded here. After the .zip file is extracted and the recordgui.m is run in Matlab, an interface will be shown. If you push the "record" button, and speak a sentence through microphone, a robot will appear and perform the demand you asked. If the sentence you speak is OOV sentence, an image will be shown and tell you "out of vocabulary". If you push the "auto-demo" button, some sentences will be picked random from a sentence library which includes both IV and OOV sentences, and the corresponding demands will be recognized and performed or OOV warn will be shown.

Download the Demo program



Appendix 3 : Codes used in this project 


reference:

[1] Yongwon Jeong, Hyung Soon Kim: "Recognition confidence scoring using recognition results from perturbed input feature vectors", Electronics Letters, Volume: 37 Issue: 18, 30 Aug. 2001, Page(s): 1143 - 1145

[2] Yong Qian Ying, Peng-Yung Woo: "Speech Recognition Using Fuzzy Logic", Neural Networks, 1999, IJCNN '99. International Joint Conference on, Volume: 5