TWB's REST API for serving automatic speech recognition (ASR) models.
Features:
- Loads and runs multiple models in parallel
- Supports Kaldi or DeepSpeech-based models
- Works on CPU
- Takes in any type of audio file
- Model specifications through a JSON-based configuration file
- Permanent or per-request vocabulary specification (with Kaldi-based models)
- Word timing information (with Kaldi-based models)
- (NEW) Per-request language model selection (with Deepspeech-based models)
- Kaldi-based models can be found at Alpha Cepei's website.
- Deepspeech-based models can be found at Coqui's model zoo.
- Place the model you want into a folder under
models
directory - Specify it in the configuration file.
models
└── my-kaldi-model
├── README
├── am
│ └── final.mdl
├── conf
│ ├── mfcc.conf
│ └── model.conf
├── graph
│ ├── Gr.fst
│ ├── HCLr.fst
│ ├── disambig_tid.int
│ └── phones
│ └── word_boundary.int
└── ivector
├── final.dubm
├── final.ie
├── final.mat
├── global_cmvn.stats
├── online_cmvn.conf
└── splice.conf
└── my-deepspeech-model
├── general.scorer
├── yes-no.scorer
├── digits.scorer
└── my-deepspeech-model.tflite
Model configurations are specified in a JSON file named config.json
. An example configuration file looks like this:
{
"languages":{"<lang-code>":"<language-name>", "en":"English", "bn":"Bengali"},
"models": [
{
"lang": "<lang-code>",
"alt": "<optional alternative tag>"
"model_type": "<vosk or deepspeech>",
"model_path": "<model directory>",
"vocabulary": "<vocabulary file path only for vosk type models>",
"scorers": {
"default": "<default scorer path only for deepspeech type models",
"<scorer-id>": "<alternative scorer path only for deepspeech type models",
}
"load": <true or false for loading at runtime>
}
}
}
languages
: a dictionary used for mapping language codes to language names.models
: a list containing specifications of models available in the system.lang
: language code of the model. This will be the main model label used in API calls.alt
: an optional extra label for the model. For example, if you have alternative models for a language, you should use this tag for the system to differentiate between them.model_type
: type of model. 'vosk' if Kaldi-based 'deepspeech' if Deepspeech based.model_path
: directory where the model files are stored. This directory should be put undermodels
directory.vocabulary
: an optional text file containing words that the ASR will be conditioned to recognize (works only with vosk type models)scorers
: dictionary of scorer id's and their paths inside the model directory (works only with deepspeech type models)load
: if set to false, model will be skipped during loading
If your application works in a restricted domain, you can specify a vocabulary file. To do that, make a text file containing all the words that you can possibly recognize line by line, place it under vocabularies
folder and speficy the filename using vocabulary
field in the model specification. (This feature works only for kaldi-based models)
Speech recognition is conditioned by what's called a language model. You can improve recognition accuracy by optimizing your language model to your task. For example if you need to recognize digits, it's better you use a language model that's trained only on text containing digits.
ASR-API allows using the same deepspeech-based acoustic model with multiple language models. To do that, just place them in the model directory and specify their ids and paths under the scorers
dictionary.
Let's say we want to build a lightweight API that serves to recognize numbers from 0 to 9. What we should do is:
- Download or clone this repository (
git clone https://github.com/translatorswb/ASR-API.git
) - Download the lightweight English model from VOSK
- Extract its contents to
models
directory. - Create a vocabulary file in
vocabularies/english-digits.txt
with the following content:
zero
one
two
three
four
five
six
seven
eight
nine
- Add the model specification to
config.json
{
"languages":{"en":"English"},
"models": [
{
"lang": "en",
"alt": "digits"
"model_type": "vosk",
"model_path": "vosk-model-small-en-us-0.15",
"vocabulary": "english-digits.txt"
}
}
}
Set the environment variables:
MT_API_CONFIG=config.json
MODELS_ROOT=models
VOCABS_ROOT=vocabularies
Install required libraries:
pip install -r requirements.txt
Run with unicorn:
uvicorn app.main:app --reload --port 8010
run_local.sh
script can also be called to run quickly once requirements are installed.
docker-compose build
docker-compose up
Transcription requests take in an audio file and responds with its transcription.
curl -L -X POST 'http://localhost:8010/transcribe/short' -F 'file=@"my_audio.wav"' -F 'lang="en"'
{ "transcript": "good day madam" , "time":1.204 }
Word timing information can be obtained by setting word_times
flag True on request. This feature currently works only with vosk models.
curl -L -X POST 'http://localhost:8010/transcribe/short' -F 'file=@"my_audio.wav"' -F 'lang="en"' -F 'word_times="True"' -F 'alt="digits"'
{
"words": [
{
"conf": 1.0,
"end": 1.14,
"start": 0.6,
"word": "one"
},
{
"conf": 1.0,
"end": 1.89,
"start": 1.35,
"word": "three"
},
{
"conf": 1.0,
"end": 2.58,
"start": 2.1,
"word": "one"
},
{
"conf": 1.0,
"end": 3.39,
"start": 2.97,
"word": "two"
}
],
"transcript": "one three one two",
"time": 0.980
}
You can restrict the model to recognize certain words during requests. To do that, enter the list of words you want to restrict to using the request field vocabulary
. (This feature works only for kaldi-based models)
curl -L -X POST 'http://localhost:8010/transcribe/short' -F 'file=@"my_audio.mp3"' -F 'lang="en"' -F 'vocabulary="[\"yes\", \"no\"]"'
{
"transcript": "yes",
"time": 0.152
}
You can specify which language model (scorer) to use on request for deepspeech-based models. To do that, specify the scorer id you used in the configuration file with scorer
field. If no scorer is specified on request, the scorer with default
id will be selected. If there's no scorer with default
id, model will be run without a language model.
curl -L -X POST 'http://localhost:8010/transcribe/short' -F 'file=@"my_audio.mp3"' -F 'lang="en"' -F 'scorer="digits"'
{
"transcript": "one",
"time": 0.121
}
Retrieves languages supported by the API.
curl -L -X GET 'http://localhost:8010/transcribe'
{
"languages": {
"en": {
"name": "English",
"scorers": []
},
"en-digits": {
"name": "English (digits)",
"scorers": []
},
"bn": {
"name": "Bengali",
"scorers": [
"default",
"glossary"
]
}
}
}