Eng. Oscar Alonso Rosete Beas
Text
Wake word detection
Speech recognition
speech-to-text
Natural language understanding and skills
Speech Synthesis
Coordinator
speech-to-intent
1
2
3
4
speech-to-text
Vosk is a speech recognition toolkit. The best things in Vosk are:
Supports 17 languages and dialects - English, Indian English, German, French, Spanish, Portuguese, Chinese, Russian, Turkish, Vietnamese, Italian, Dutch, Catalan, Arabic, Greek, Farsi, Filipino.
Works offline, even on lightweight devices - Raspberry Pi, Android, iOS
Docs:
https://alphacephei.com/vosk/
Rhasspy is composed of independent services that coordinate over MQTT using a superset of the Hermes protocol.
Docs:
https://rhasspy.readthedocs.io/en/latest/#services
Picovoice is the end-to-end platform for building voice products on your terms. Unlike Alexa and Google services, Picovoice runs entirely on-device while being more accurate. Using Picovoice, one can infer a user’s intent from a naturally spoken utterance such as:
Hey Edison, set the lights in the living room to blue.
Docs:
https://picovoice.ai/docs/
Tech | Owner | License | Internet | Wake word | Speech-text | Speech-intent | Speech synthesis |
---|---|---|---|---|---|---|---|
Google recognition | - | only with internet | Yes | Yes | Yes | Yes | |
Deep Speech | Mozilla | Open source | offline | No | Yes | No | Yes |
Picovoice | Rhino | Free personal usage | offline | Yes | Yes | Yes | Yes |
Vosk | Alpha Cephei | Open source | offline | No | Yes | No | No |
Snowboy | Discontinued | Open source | offline | Yes | No | No | No |
Mycroft | Mycroft Community | Open source | offline | Yes | Yes | No | Yes |
Rhasspy | Community | Open source | offline | Yes | Yes | Yes | Yes |
Among the evaluated tools, we have 3 promising tools
Google's Speech recognition, Rhino's Picovoice and the Rhasspy project.
Based on the offline usage requirement the most promising ones are Picovoice and Rhasspy, even though Picovoice is not free for commercial use and Rhasspy is intended to use for linux computers.
Further lines of research can be done regarding own wake up word detection software and implementation through vosk api or Deep speech.
Total number of hours required for own development is going to be detailed in the following slides.
The following hours of research are required to ensure compatibility of rhasspy with android.
Currently supporting Raspberry Pi 2-3 B/B+ (armhf/aarch64), Desktop/laptop/server (amd64) and Raspberry Pi Zero (armv6l)
https://rhasspy.readthedocs.io/en/latest/hardware/
It is estimated that these sections involve a total effort of 40 hours.
Android demo:
https://github.com/razzo04/rhasspy-mobile-app
The following hours of research are required to use vosk-api or Mozilla Deep Speech
Testing voice-2-json, pocketshpinx and snowboy to make sure they are compatible with the other well-known api is estimated to require 40 hours.