Wednesday, April 26, 2017

Energy Threshold Calibration in Speech Recognition

In my last post on Speech Recognition, I showed how to setup the Python SpeechRecognition package with PyAudio, and pocketsphinx to recognize speech with just a few lines of code. And, as you can remember, we ran into issues where the speech recognition just hangs there unable to recognize our speaking.

Speech Recognition just hanging there, not recognizing that you're speaking
Speech Recognition just hanging there, not recognizing that you're speaking

We found out that this was happening due to ambient noise.

Although we humans are able to distinguish speech from noise naturally, for a computer program they are just audio levels. It needs to know which levels should be considered speech (which it needs to process in order to recognize what's being said), and which levels should be considered silence or background noise. So, libraries like the SpeechRecognition has an energy threshold set which defines what audio level and above should be considered speech.

Now, this default energy threshold works most of the time. If your environment is sufficiently quiet, it will be able to recognize you talking without problems. But, if your environment is noisy - e.g. an office environment with many people talking, or there's machinery around - then the program will have issues distinguishing speech from noise, which will cause the issue we observed.

So, in a situation like that, we should adjust the energy threshold to properly distinguish the speech from noise. The SpeechRecognition package has a couple of parameters that helps you with this.

Thursday, April 13, 2017

How deep should it be to be called Deep Learning?

If you remember, some time back, I made an article on What is Deep Learning?, in which I explored the confusion that many have on terms Artificial Intelligence, Machine Learning, and Deep Learning. We talked about how those terms relate to each other: how the drive to build an intelligent machine started the field of Artificial Intelligence, when building an intelligence from scratch proved too ambitious, how the field evolved into Machine Learning, and with the expansion of both the capabilities of computer hardware and our understanding of the natural brain, dawned the field of Deep Learning

We learned that the deeper and more complex models (compared to traditional models) of Deep Learning are able to consume massive amounts of data, and able to learn complex features by Hierarchical Feature Learning through multiple layers of abstraction. We saw that Deep Learning algorithms don’t have a "plateau in performance" compared to traditional machine learning algorithms: that they don’t have a limit on the amount of data they can ingest. Simply, the more data they are given, the better they would perform.

The Plateau in Performance in Traditional vs. Deep Learning
The Plateau in Performance in Traditional vs. Deep Learning


With the capabilities of Deep Learning grasped, there’s one question that usually comes up when one first learns about Deep Learning:

If we say that deeper and more complex models gives Deep Learning models the capabilities to surpass even human capabilities, then how deep a machine learning model should be to be considered a Deep Learning model?

I’ve had the same question when I was first getting started with Deep Learning, and I had few other Deep Learning enthusiasts asking me the same question.

It turns out, we were asking the wrong question. We need to look at Deep Learning from a different angle to understand it.

Let’s take a step back and see how a Deep Learning model works.

Tuesday, April 4, 2017

Extracting individual Facial Features from Dlib Face Landmarks

If you remember, in my last post on Dlib, I showed how to get the Face Landmark Detection feature of Dlib working with OpenCV. We saw how to use the pre-trained 68 facial landmark model that comes with Dlib with the shape predictor functionality of Dlib, and then to convert the output of into a numpy array to use it in an OpenCV context. We were able to get all 68 feature points on to our face image.

Dlib detecting the 68 Face Landmarks
Dlib detecting the 68 Face Landmarks

The 68 feature points which the Dlib model detects include the Jawline of the face, left and right eyes, left and right eyebrows, the nose, and the mouth. So, what if you only want to detect few of those features on a face? E.g. you may only want to detect the positions of the eyes and the nose. Is there a way to extract only few of the features from the Dlib shape predictor?

There is actually a very simple way to do that. Here’s how.