Offline speech to text to trigger custom commands on Android with Kaldi and Vosk

Speech to text on Android usually means using the built-in speech recognizer which connects with Google cloud. Offline recognition is possible too, and, let's be honest, support for 120 languages is pretty impressive. However, I'd like to experiment, and also be as consistent as possible when it comes to building and using open source, so I started looking for alternatives. I can't remember how long ago I went to at least page 5 or 6 in search results, but I suddenly stumbled onto Kaldi and Vosk.

The use case for speech to text for me is simple: voice commands. I'm currently building an app, nicknamed Solfidola. It's still in early state1, but the main goal is to create an app to learn Solfege. It's an interesting project from a technology point of view: render music sheets, place notes on the right position in the bar, play them with the right pitch and interval and getting soundfonts working with the midi drivers. You can create exercises to test one or more intervals, and this is where I wanted to introduce the voice commands: say you create an exercise with 3 intervals, you are presented with those 3 choices. And when you hear the interval, I wanted to be able to speak the position of the solution by just saying one, two or three. Or say 'play' to hear the interval again. That's it, and thanks to Kaldi it works beautifully now!

The implementation

An example Android app is available at Kaldi Android Demo on GitHub which ships with the Kaldi library and a small English model. The two most important methods are onPartialResult() and onResult() from the RecognitionListener interface which return results from the STT engine. The string comes back as JSON, so it involves a bit of parsing first before you can take action.

    @Override
    public void onResult(String s) {
        String match = "";

        try {
            JSONObject o = new JSONObject(s);
            if (o.has("text")) {
                match = o.getString("text");
                // We have a result, check if we can push a button.
                checkSolutionFromSpeech(match);
            }
        }
        catch (JSONException ignored) { }

    }

The checkSolutionFromSpeech() method then goes on to check whether the word that came back matches a solution and gives feedback whether you are right or wrong. See the source for more inspiration for your own application.

Making the model smaller

The English2 language model is relatively large, around 50MB. Since I only need a few words to be recognized, I wondered whether it was possible to make it smaller. Turns out it's not that hard once you have Kaldi installed on your machine. Tip: in case you install Kaldi from source, only compile the tools folder, that's all you need for making the model smaller using only the words you need.

The steps are described in the Vosk adaptation document on GitHub. I created a 'text.txt' file which contains about 15 words at the moment, ran the commands which saves me about 20MB of storage for the app, which is a great deal!

I'd like to take a step further and try and figure out whether it's possible to save more storage in either the model or the kaldi aar library. But, first things first, I need to practice my interval recognition now :)

Footnotes

1. If you're interested in testing, contact me and I can add you to the alpha program!
2. Models for other languages are available at Vosk models.
no comments yet - 2 interactions