This would be an interesting project to teach yourself machine learning/neural networks, but you may be underestimating its difficulty. You don't want it to turn off the system whenever you say "sheep" or "sleet" or "steep" (or it picks up speech from the TV). "Alexa" was not chosen at random -- that long sibilance in "ksssa" is intentionally distinctive in the upper frequencies. And Alexa exists only to signal the start of a new sampling event -- working on continuous speech is a whole extra problem.
Typically, you put the speech sample through an FFT to get a spectral image, standardise against the peak volume, and sample it at carefully chosen frequencies. Computers can't deal with raw wave-forms unless they are reasonably normalised first.
You then need many examples of each word or phrase, and also of sounds outside its required training scope: if it does not have a training category of "none of the above", it will guess at the "nearest". Consider how useless some spelling correction lists are, even when they start from text input.
It would be far easier to train something only for your own voice. Commercial systems also need to filter out regional accents, voice tones (e.g. male/female), and background noise (including non-linear microphone response).
Even when you can say "sleep" and have the app come up with something like "cmd 19: 72% match: sleep", you have to use that as a kind of menu entry to issue whatever command or service is required to implement the corresponding action.