Detect simple voice commands

Question

I would like to detect simple words or phrases from my microphone and perform actions based on those phrases. I've looked into Python libraries and Google text-to-speech but these seem like extreme overkill¹. I don't need something that is capable of recognizing every phoneme or word in the English language, I just want to detect certain phrases like "go to sleep" or even just "sleep" to make my computer sleep for example.

I tried searching for this, but mostly I just get programs for dictation and posts from 10 years ago.

1. For example I stumbled across this article that relies on web services or installing something heavy-duty like Sphinx. Can't I just train a model to respond to certain phrases instead of every possible phrase?

score 3 · Answer 1 · answered Oct 25 '20 at 15:58

This would be an interesting project to teach yourself machine learning/neural networks, but you may be underestimating its difficulty. You don't want it to turn off the system whenever you say "sheep" or "sleet" or "steep" (or it picks up speech from the TV). "Alexa" was not chosen at random -- that long sibilance in "ksssa" is intentionally distinctive in the upper frequencies. And Alexa exists only to signal the start of a new sampling event -- working on continuous speech is a whole extra problem.

Typically, you put the speech sample through an FFT to get a spectral image, standardise against the peak volume, and sample it at carefully chosen frequencies. Computers can't deal with raw wave-forms unless they are reasonably normalised first.

You then need many examples of each word or phrase, and also of sounds outside its required training scope: if it does not have a training category of "none of the above", it will guess at the "nearest". Consider how useless some spelling correction lists are, even when they start from text input.

It would be far easier to train something only for your own voice. Commercial systems also need to filter out regional accents, voice tones (e.g. male/female), and background noise (including non-linear microphone response).

Even when you can say "sleep" and have the app come up with something like "cmd 19: 72% match: sleep", you have to use that as a kind of menu entry to issue whatever command or service is required to implement the corresponding action.

Detect simple voice commands

1 Answers1

Linked