Wednesday, April 26, 2017

Energy Threshold Calibration in Speech Recognition

In my last post on Speech Recognition, I showed how to setup the Python SpeechRecognition package with PyAudio, and pocketsphinx to recognize speech with just a few lines of code. And, as you can remember, we ran into issues where the speech recognition just hangs there unable to recognize our speaking.

Speech Recognition just hanging there, not recognizing that you're speaking
Speech Recognition just hanging there, not recognizing that you're speaking

We found out that this was happening due to ambient noise.

Although we humans are able to distinguish speech from noise naturally, for a computer program they are just audio levels. It needs to know which levels should be considered speech (which it needs to process in order to recognize what's being said), and which levels should be considered silence or background noise. So, libraries like the SpeechRecognition has an energy threshold set which defines what audio level and above should be considered speech.

Now, this default energy threshold works most of the time. If your environment is sufficiently quiet, it will be able to recognize you talking without problems. But, if your environment is noisy - e.g. an office environment with many people talking, or there's machinery around - then the program will have issues distinguishing speech from noise, which will cause the issue we observed.

So, in a situation like that, we should adjust the energy threshold to properly distinguish the speech from noise. The SpeechRecognition package has a couple of parameters that helps you with this.


The most straightforward way would be to set the energy_threshold parameter.
 import speech_recognition as sr  
   
 r = sr.Recognizer()  
 r.energy_threshold = 400  

The energy_threshold value is set to 300 by default. Under 'ideal' conditions (such as in a quiet room), values between 0 and 100 are considered silent or ambient, and values 300 to about 3500 are considered speech.

But, we all know that 'ideal' conditions are quite rare practically. Which means we will have to fiddle with this value to get it right.

But how do we find out the proper value for the environment we're at?

Well, the SpeechRecognition package has few other parameters to help us with that.

The first one -and the easiest - is the adjust_for_ambient_noise function, which I used in my earlier post.
With adjust_for_ambient_noise, you ask the program to listen to the ambient noise for some time, and adjust the energy threshold accordingly.
 import speech_recognition as sr   
   
 r = sr.Recognizer()   
 with sr.Microphone() as source:   
     print("Please wait. Calibrating microphone...")   
     # listen for 5 seconds and calculate the ambient noise energy level   
     r.adjust_for_ambient_noise(source, duration=5)

     # rest of your code

Here, we have set it to listen for 5 seconds to get the ambient noise level.

The duration parameter defines the maximum number of seconds it should listen to in order to adjust the energy threshold. If the program detects speech before this duration ends (e.g. it hears a significantly higher level than ambient noise it's hearing), it will stop early and adjust the level. So, make sure there's no speech being heard when running this. The default duration is 1 second, and the documentation suggests a duration of at least 0.5 seconds to calculate the energy level accurately.

So, with adjust_for_ambient_noise we can adjust the energy threshold to a good value at the start of the program. But, what if the ambient noise levels of your environment keeps changing?

That's where the dynamic_energy_threshold parameter comes in.

With dynamic_energy_threshold set to 'True', the program will continuously try to re-adjust the energy threshold to match the environment based on the ambient noise level at that time.
 import speech_recognition as sr  
   
 r = sr.Recognizer()  
 r.energy_threshold = 4000  
 r.dynamic_energy_threshold = True  

Here, we have set the initial energy threshold to a high value, and have set dynamic_energy_threshold to True. The program will gradually lower the threshold value to a value that works with the current environment, and will keep updating it if the ambient noise levels changes.

So, what's the best way to setup the energy levels?
In my testing, I have found that using both adjust_for_ambient_noise and dynamic_energy_threshold give the best results in most cases.
 import speech_recognition as sr   
   
 r = sr.Recognizer()   
 with sr.Microphone() as source:   
     print("Please wait. Calibrating microphone...")   
     # listen for 5 seconds and create the ambient noise energy level   
     r.adjust_for_ambient_noise(source, duration=5)  
     r.dynamic_energy_threshold = True  
       
     # rest of your code  

You can play around with these parameters, and see what works best for the environment you'll be using Speech Recognition on.

Related posts:
Easy Speech Recognition in Python with PyAudio and Pocketsphinx

Related links:
https://github.com/Uberi/speech_recognition/blob/master/reference/library-reference.rst

No comments:

Post a Comment