123

The short version of the question: I am looking for a speech recognition software that runs on Linux and has decent accuracy and usability. Any license and price is fine. It should not be restricted to voice commands, as I want to be able to dictate text.


More details:

I have unsatisfyingly tried the following:

All the above-mentioned native Linux solutions have both poor accuracy and usability (or some don't allow free-text dictation but only voice commands). By poor accuracy, I mean an accuracy significantly below the one the speech recognition software I mentioned below for other platforms have. As for Wine + Dragon NaturallySpeaking, in my experience it keeps crashing, and I don't seem to be the only one to have such issues unfortunately.

On Microsoft Windows I use Dragon NaturallySpeaking, on Apple Mac OS X I use Apple Dictation and DragonDictate, on Android I use Google speech recognition, and on iOS I use the built-in Apple speech recognition.

Baidu Research released yesterday the code for its speech recognition library using Connectionist Temporal Classification implemented with Torch. Benchmarks from Gigaom are encouraging as shown in the table below, but I am not aware of any good wrapper around to make it usable without quite some coding (and a large training data set):

System Clean (94) Noisy (82) Combined (176)
Apple Dictation 14.24 43.76 26.73
Bing Speech 11.73 36.12 22.05
Google API 6.64 30.47 16.72
wit.ai 7.94 35.06 19.41
Deep Speech 6.56 19.06 11.85

Table 4: Results (%WER) for 3 systems evaluated on the original audio. All systems are scored only on the utterances with predictions given by all systems. The number in the parentheses next to each dataset, e.g. Clean (94), is the number of utterances scored.

There exist some very alpha open-source projects:

I am also aware of this attempt at tracking states of the arts and recent results (bibliography) on speech recognition. as well as this benchmark of existing speech recognition APIs.


I am aware of Aenea, which allows speech recognition via Dragonfly on one computer to send events to another, but it has some latency cost:

enter image description here

I am also aware of these two talks exploring Linux option for speech recognition:

  • 2
    Some detail about what you found "unsatisfying" might advance your otherwise interesting but rather general posting topic. For example: what specifically did you find unsatisfying about the "Wine + Dragon NaturallySpeaking" combination? (how did it fail to replicate your Windows experience?) – Theophrastus Jan 18 '16 at 18:20
  • 1
    @Theophrastus Basically all native Linux solutions have both poor accuracy and usability. By poor accuracy, I mean an accuracy significantly below the one the speech recognition software I mentioned for other platforms have. As for Wine + Dragon NaturallySpeaking, in my experience it keeps crashing, and I don't seem to be the only one to have such issues unfortunately (https://appdb.winehq.org/objectManager.php?sClass=application&iId=2077) – Franck Dernoncourt Jan 18 '16 at 18:24
  • 1
    I haven't tried these, but in case someone finds it useful: https://github.com/Uberi/speech_recognition and https://jasperproject.github.io/ and https://github.com/benoitfragit/google2ubuntu – Hatshepsut Jan 06 '17 at 18:18
  • Is there one of these software that has a command-line tool? It would be very interesting to combine speech recognition to a keypress and mousemove tool like xdotool (https://github.com/jordansissel/xdotool) or xsendkey (https://github.com/kyoto/sendkeys). – baptx Mar 05 '19 at 14:15
  • 1
    @baptx, https://github.com/MycroftAI/mycroft-core/issues/2600 – alchemy Jun 07 '20 at 17:06
  • Related: https://askubuntu.com/questions/161515/speech-recognition-app-to-convert-mp3-to-text – Ciro Santilli OurBigBook.com Oct 07 '20 at 16:49

13 Answers13

30

vosk-api

https://github.com/alphacep/vosk-api/

It supports 20+ languages.

First you convert the file to the required format, and then you recognize it:

ffmpeg -i file.mp3 -ar 16000 -ac 1 file.wav

Then install vosk-api with pip:

pip3 install vosk

Then use these steps:

git clone https://github.com/alphacep/vosk-api
cd vosk-api/python/example
wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.3.zip
unzip vosk-model-small-en-us-0.3.zip
mv vosk-model-small-en-us-0.3 model
python3 ./test_simple.py test.wav  > result.json

The result is stored in JSON format.

The same directory also contains an SRT subtitle output example, which is more human-readable and can be directly useful to people with that use case:

python3 -m pip install srt
python3 ./test_srt.py test.wav

The sections below show some testing I did with it.

test.wav case study

The test.wav example given in the repository says in perfect American English accent and perfect sound quality three sentences which I transcribe as:

one zero zero zero one
nine oh two one oh
zero one eight zero three

The "nine oh two one oh" is said very fast, but still clear. The "z" of the before last "zero" sounds a bit like an "s".

The SRT generated above reads:

1
00:00:00,870 --> 00:00:02,610
what zero zero zero one

2 00:00:03,930 --> 00:00:04,950 no no to uno

3 00:00:06,240 --> 00:00:08,010 cyril one eight zero three

so we can see that several mistakes were made, presumably in part because we have the understanding that all words are numbers to help us.

Next I also tried with the vosk-model-en-us-aspire-0.2 which was a 1.4GB download compared to 36MB of vosk-model-small-en-us-0.3 and is listed at https://alphacephei.com/vosk/models:

mv model model.vosk-model-small-en-us-0.3
wget https://alphacephei.com/vosk/models/vosk-model-en-us-aspire-0.2.zip
unzip vosk-model-en-us-aspire-0.2.zip
mv vosk-model-en-us-aspire-0.2 model

and the result was:

1
00:00:00,840 --> 00:00:02,610
one zero zero zero one

2 00:00:04,026 --> 00:00:04,980 i know what you window

3 00:00:06,270 --> 00:00:07,980 serial one eight zero three

which got one more word correct.

IBM "Think" Speech case study

Now let's have some fun, shall we. From https://en.wikipedia.org/wiki/Think_(IBM) (public domain in the USA):

wget https://upload.wikimedia.org/wikipedia/commons/4/49/Think_Thomas_J_Watson_Sr.ogg
ffmpeg -i Think_Thomas_J_Watson_Sr.ogg -ar 16000 -ac 1 think.wav
time python3 ./test_srt.py think.wav > think.srt

The sound quality is not great, with a lot of microphone hissing noise due to the technology of the time. The speech is however very clear and paused. The recording is 28 seconds long, and the wav file is 900KB large.

Conversion took 32 seconds. Sample output of the three first sentences:

1
00:00:00,299 --> 00:00:01,650
and we must study

2 00:00:02,761 --> 00:00:05,549 reading listening name scott

3 00:00:06,300 --> 00:00:08,820 observing and thank you

and the Wikipedia transcription for the same segment reads:

1
00:00:00,518 --> 00:00:02,513
And we must study

2 00:00:02,613 --> 00:00:08,492 through reading, listening, discussing, observing, and thinking.

"We choose to go to the Moon" case study

https://en.wikipedia.org/wiki/We_choose_to_go_to_the_Moon (public domain)

OK, one more fun one. This audio has good sound quality, with occasional approval screams by the crowd, and a slight echo of the venue:

wget -O moon.ogv https://upload.wikimedia.org/wikipedia/commons/1/16/President_Kennedy%27s_Speech_at_Rice_University.ogv
ffmpeg -i moon.ogv -ss 09:12 -to 09:29 -q:a 0 -map a -ar 16000 -ac 1 moon.wav
time python3 ./test_srt.py moon.wav > moon.srt

Audio duration: 17s, wav file size 532K, conversion time 22s, output:

1
00:00:01,410 --> 00:00:16,800
we choose to go to the moon in this decade and do the other things not because they are easy but because they are hard because that goal will serve to organize and measure the best of our energies and skills

and the corresponding Wikipedia captions:

89
00:09:06,310 --> 00:09:18,900
We choose to go to the moon in this decade and do the other things,

90 00:09:18,900 --> 00:09:22,550 not because they are easy, but because they are hard,

91 00:09:22,550 --> 00:09:30,000 because that goal will serve to organize and measure the best of our energies and skills,

Perfect except for a missing "the" and punctuation!

Tested on vosk-api 7af3e9a334fbb9557f2a41b97ba77b9745e120b3, Ubuntu 20.04, Lenovo ThinkPad P51.

This answer is based on https://askubuntu.com/a/423849/52975 by Nikolay Shmyrev with additions by me.

NERD dictation (uses the VOSK-API)

https://github.com/ideasman42/nerd-dictation and see also: https://unix.stackexchange.com/a/651454/32558

Benchmarks

https://github.com/Picovoice/speech-to-text-benchmark mentions a few:

It would be interesting to run/find results of VOSK vs other software on those.

Ciro Santilli OurBigBook.com
  • 18,092
  • 4
  • 117
  • 102
  • Wow. I did a quick test with my own voice. Partially messy and not in my native tongue, Vosk did a better job than DeepSpeech out of the box. I'm impressed. What's the catch? :) – creativecoding Mar 21 '21 at 01:17
  • 1
    @creativecoding if someone tries to scam you, show them this file and fork ;-) – Ciro Santilli OurBigBook.com Mar 21 '21 at 08:20
  • 2
    The VOSK-API is excellent, but doesn't provide basic integration try https://github.com/ideasman42/nerd-dictation - a utility the integrate it with pulse audio and X11. – ideasman42 May 25 '21 at 17:45
  • @ideasman42 cool idea! Add a screenshot to the repo if there's a GUI! – Ciro Santilli OurBigBook.com May 25 '21 at 19:49
  • 1
    Added a video demo, linked from the repo. – ideasman42 Jun 01 '21 at 19:43
  • @ideasman42 ah nice, it actually backspace removes errors as it guesses. And with that beautiful accent, it's no wonder it understands you perfectly! :-) – Ciro Santilli OurBigBook.com Jun 01 '21 at 19:56
  • Just tested Vosk on my own voice then took a film and cut 3 minutes out with ffmpeg to mono wav as it likes that format. Very underwhelming results as not to say gobbledegook was produced very funny at times and reminiscent of autotranscription on YT but actually even worse; so to me not usable; so many thanx for info here but frankly it is still a long way off :] – shantiq Jun 16 '21 at 07:15
  • @shantiq thanks for the report, share the audio if you can. – Ciro Santilli OurBigBook.com Jun 16 '21 at 07:40
  • 1
    Hi Ciri here it is a 3mn wav and the srt obtained thru Vosk https://mega.nz/folder/BkgwlbaL#bEwX-i5Np1fpC6anZG_O8Q – shantiq Jun 17 '21 at 15:46
  • 3
    I write emails for a living basically, and have been a long-time user of Dragon, first directly in Windows for a few years, and then via Swype/KDE Connect (most-upvoted answer) for maybe 6 months. I tried VOSK today w/ big static daanzu model and found it to be about as good. Accuracy for ordinary English is super-high, with most errors of the picked-the-wrong-homophone variety. A few annoyances but Dragon also had a few annoyances. I miss punctuation but can probably hack that in somehow via nerd-dictation config. Nerd-dictation is convenient UI w/ Gnome keyboard bindings. Worth a try. – joseph_morris Jun 24 '21 at 00:39
  • Spent literally a day trying to install vosk and got nowhere, the documentation for installation is the most half-assed thing i've ever seen. The support seems like people going around in circles trying to guess the exact python version that they need to get it work. I tried to install and pip3 just refuses to do anything because it doesn't meet the requirements. I've even moved to a different version of python compiled it myself and everything, got absolutely nowhere with it. – Owl Feb 04 '22 at 00:48
  • @Owl do link to a bug report with all your system details if you can. Worse case, copy my exact setup, vosk 7af3e9a334fbb9557f2a41b97ba77b9745e120b3 in an Ubuntu 20.04 Docker, and then diff out with your setup. It worked easily for me, but I know I could have just gotten lucky. – Ciro Santilli OurBigBook.com Feb 04 '22 at 09:03
  • 1
    Thanks for the excellent explaination. Your answer helped me run it in 2 mins and it's working great! – supersan May 07 '22 at 12:30
28

Try nerd-dictation, it's a simple way to access VOSK-API, which is a high quality offline, open-source speech to text engine which works with both X11 and Wayland.

See demo video.


full disclosure, I couldn't find any solutions that suited my use case, so I wrote this small utility to scratch my own itch.

ideasman42
  • 1,211
  • 1
    This works great for me for so far! I added the example script to use the start/stop phrases and then added it to my startup. Using it for working from home. – Ryan Hartman Nov 05 '21 at 19:33
  • 1
    Also use it working from home (might be a bit odd using it in an office :) ), although I managed to setup my keyboard (with QMK) so I can hold a key while speaking for dictation. – ideasman42 Nov 06 '21 at 02:00
26

Right now I'm experimenting with using KDE connect in combination with Google speech recognition on my android smartphone.

KDE connect allows you to use your android device as an input device for your Linux computer (there are also some other features). You need to install the KDE connect app from the Google play store on your smartphone/tablet and install both kdeconnect and indicator-kdeconnect on your Linux computer. For Ubuntu systems the install goes as follows:

sudo add-apt-repository ppa:vikoadi/ppa
sudo apt update
sudo apt install kdeconnect indicator-kdeconnect

The downside of this installation is that it installs a bunch of KDE packages that you don't need if you don't use the KDE desktop environment.

Once you pair your android device with your computer (they have to be on the same network) you can use the android keyboard and then click/press on the mic to use Google speech recognition. As you talk, text will start to appear where ever your cursor is active on your Linux computer.

As for the results, they are a bit mixed for me as I'm currently writing some technical astrophysics document and Google speech recognition is struggling with the jargon that you don't typically read. Also forget about it figuring out punctuation or proper capitalization.

enter image description here

enter image description here

  • 21
    The problem with google is it's not text to speech, it sends it back to google. This is bad for privacy. – Owl Dec 12 '19 at 15:33
  • After struggling with audio-to-text utilities on Linux for a long time, I solved the problem with a trivial hack: just play the audio over my laptop speakers and put my phone next to it, with Google Docs in text-to-speech mode. Stupid but it worked :) – Resigned June 2023 Mar 07 '20 at 00:34
  • 4
    I am surprised that this is still the "best" answer, and continues to slowly accumulate votes. – shockburner Jan 13 '21 at 17:58
  • This screenshot actually shows Swype, which is Nuance (now owned by Microsoft), not Google voice typing. Google voice typing on Android (GBoard, and I think many "stock" keyboards include it) does not work with KDE Connect, as far as I can tell because KDE Connect asks the keyboard for single-press type input, rather than free-form text. This puts Gboard into a mode where voice typing is not available. See KDE bug 365305 https://bugs.kde.org/show_bug.cgi?id=365305 If someone finds Google voice typing that works with KDE Connect, please say how! – joseph_morris Apr 20 '21 at 18:42
  • 1
    @joseph_morris When I first posted this answer (4.5 years ago), it did work with GBoard. I have not tried it since then. The attached photos were added by the OP as I had insufficient reputation at the time to post photos. – shockburner Apr 21 '21 at 18:39
  • I wound up seeing if Google would recognize streams of expletives when I discovered that the "Voice Typing" features of Google Docs (when used with Chrome only) doesn't save the audio to one's Google account, which is a requirement of mine – Michael Nov 10 '21 at 17:14
  • Is Android hardware on-topic? – user598527 Jun 13 '22 at 06:47
  • it works ok enough but if your internet isnt slamin, youll be gettin network bottlenecks in no time – j0h Sep 12 '22 at 23:02
  • It doesn't work well. Effectively, it doesn't work at all. When I send some text from my mobile, the KDE Connect receives just a few symbols out of the initial text. It's an extremely weird technology. – Onkeltem Dec 10 '22 at 13:40
8

OpenAI's Whisper (MIT license, Python 3.9, CLI) yields some highly accurate transcription. To use it (tested on Ubuntu 20.04 x64 LTS):

conda create -y --name whisperpy39 python==3.9
conda activate whisperpy39
pip install git+https://github.com/openai/whisper.git 
sudo apt update && sudo apt install ffmpeg
whisper recording.wav
whisper recording.wav --model large

If using an Nvidia 3090 GPU, add the following after conda activate whisperpy39

pip install -f https://download.pytorch.org/whl/torch_stable.html
conda install pytorch==1.10.1 torchvision torchaudio cudatoolkit=11.0 -c pytorch

Performance info below.

Model inference time:

Size Parameters English-only model Multilingual model Required VRAM Relative speed
tiny 39 M tiny.en tiny ~1 GB ~32x
base 74 M base.en base ~1 GB ~16x
small 244 M small.en small ~2 GB ~6x
medium 769 M medium.en medium ~5 GB ~2x
large 1550 M N/A large ~10 GB 1x

WER on several corpus from https://cdn.openai.com/papers/whisper.pdf:

enter image description here

WER on several languages from https://github.com/openai/whisper/blob/main/language-breakdown.svg:

enter image description here

5

After trying Simon and Julius on Kubuntu, which I wasnt able to install properly, I stumbled on the idea to try using Mycroft, the open source AI Assistant (competing with Google Home and Amazon Alexa).

After having the KDE Plasmoid install fail, I was able to get pretty good speech recognition going with the regular install. It has a mycroft-cli-client to view debugging messages in and a somewhat active community forum. Some of the docs are a little out of date, but I have noted that on the forum and in GitHub where applicable.

The speech rec is really pretty good and you can install Mimic, a local recognition engine. And it is cross-platform and saw an Android app I havent tried yet. My next step is reproduce some of the basic desktop shortcut commands I was hoping for in the Plasmoid, and a dictation Skill for large text fields.

https://github.com/MycroftAI/mycroft-core

https://community.mycroft.ai/

alchemy
  • 597
5

You might be interested in Numen, which is voice input for desktop computing without a keyboard or mouse. It's another project that uses the vosk-api speech recognition.

I'm the creator of Numen and you can find a short demonstration here.

James Risner
  • 1,282
geb
  • 83
  • 1
  • 5
3

As one more Linuxer searching for a useful speech-to-text (dictation) program, I took a look into speechpad.pw:

  • it recognizes my mother tongue very well
  • it works fast and very reliable

Downsides:

  • of course it is proprietary and closed software from Google
  • a Google service will listen to, process and supposedly store every word you speak
  • audio and text will be processed and obviously stored by Google
  • speechpad.pw requires a monthly / quaterly / yearly subscription fee
  • speechpad.pw only runs as an addon to Google Chrome browser - no other browser

So, speechpad.pw is very proprietary and also closed source and also bound to Google which we all know as a sleepless meta data, personal information and personal contents collector.

These downsides make it a no-go application for me though the speech recognition itself works very well - much better than anything else I have seen so far.

too
  • 49
  • Thanks, yes significant downsides, especially that it only works in the Chrome browser. – Franck Dernoncourt Oct 28 '16 at 22:45
  • 2
    You could use Google Docs on Chrome and use their "Tools" » "Voices Typing ..." option. Probably exact same speech recognition software, but it's free. Then copy paste the results from your doc to wherever you need the text. – Alexis Wilke Nov 10 '17 at 20:19
3

I'm using the KDE Connect app.

It is working quite effectively! I am able to keep my eyes on the monitor while speaking with the phone on the desk.

The only downside is that this is being done through Google keyboard. It is neither free, native, nor open source.

3

I'd recommend Mozilla DeepSpeech. It's an opensource speech to text tool. But you will need to train the tool.

You can download the pre-trained model or use Mozilla Common Voice DataSets to create your own. For very clear recordings accuracy rate is good. For my transcription projects, it was still not sufficient, as the recordings had lots of background noises, and were not of great quality.

I used Transcribear instead, a browser based speech to text tool. You will need to be connected online to upload recordings to the Transcribear server.

Paulo Tomé
  • 3,782
John
  • 39
2

The Chrome App "VoiceNote II" (http://voicenote.in/) is working great on my Xubuntu 16.04 machine. No voice-training required, and set-up was simple. One search to find it, one click to install, one click to create a shortcut and to the Desktop bind it.

2

A post I created recently had some of this information answered in a little more detail (credit to geb and adabru for some of the information below) which may be helpful to read, bookmark and check back for updates: Eye Gaze Tracking With Head Tracking Solutions On Linux

One of the more productive and easier options to set up according to adabru, https://handsfreecoding.org/ and many others I've come across online: https://talonvoice.com

Appears to work offline for analysing spoken words (see 7. Privacy): https://talonvoice.com/EULA.txt

You can use the Vosk engine in Talon for other language support if you pay $25/month, at the time of writing this, for the Beta version (see Vosk and the Talon community wiki for languages supported):

https://alphacephei.com/vosk/

https://talon.wiki/speech_engines/

https://talon.wiki/faq/#are-languages-other-than-english-supported

There is also a free version of Talon but keep in mind that Talon isn't all open source code.

I would give Numen a hard look. It's free and open source software that uses Vosk which supports other languages. Looks like a very good option if you primarily use keyboard-centric programs (some are listed in the link): https://git.sr.ht/%7Egeb/numen

1

I would suggest using dragon on your phone or tablet, then emailing the text to yourself. Its a drag but it works and is very accurate. If you insist on using Linux for this, getting a second display will make life much easier to copy and past.

I haven't tried this but you might be able to use or adapt the Python Bluetooth Chat program with dragon on your tablet/phone. There may also be remote-keyboard apps for mobile devices that may support dictation input.

I shall experiment and try to get back to you with something more definitive.

0

Deepspeech

To install it:

# Create and activate a virtualenv
virtualenv -p python3 $HOME/tmp/deepspeech-venv/
source $HOME/tmp/deepspeech-venv/bin/activate

Install DeepSpeech

pip3 install deepspeech

Download pre-trained English model files

curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer

Download example audio files

curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/audio-0.9.3.tar.gz tar xvf audio-0.9.3.tar.gz

Transcribe an audio file

deepspeech --model deepspeech-0.9.3-models.pbmm --scorer deepspeech-0.9.3-models.scorer --audio audio/2830-3980-0043.wav

I recorded verse of the dhammapada and put it into deepspeech and it got it with 100% accuracy.

Owl
  • 215