Creates a Kaldi nnet3 recipe from transcribed audio using the International Phonetic Alphabet for word pronunciations. Unknown words have pronunciations predicted with phonetisaurus.
This project is inspired by Zamia Speech, and is intended to supply acoustic models built from open speech corpora to the Rhasspy project for many human languages.
Check out the pre-trained models.
- Python 3.7 or higher
- CUDA and cuDNN
- See installing CUDA
- Kaldi compiled with support for CUDA
- Install CUDA/cuDNN before compiling Kaldi
- See installing Kaldi
- Tested on Ubuntu 18.04 (bionic) with CUDA 10.2 and cuDNN 7.6
- gruut
- Used to generate IPA word pronunciations
ipa2kaldi does not automatically download or unpack audio datasets for you. A dataset is expected to exist in a single directory with:
- A
metadata.csv
file- Delimiter is
|
and there is no header - Either
id|text
(need--speaker
argument) orid|speaker|text
- Corresponding WAV file must be named
<id>.wav
- Delimiter is
- WAV files in 16Khz 16-bit mono PCM format
Download the source code and create the Python virtual environment:
$ pip install ipa2kaldi
for Raspberry Pi (ARM), you will first need to manually install phonetisaurus.
$ python3 -m ipa2kaldi /path/to/kaldi/egs/<model_name>/s5 \
--language <language-code> \
--dataset /path/to/dataset1 \
--dataset /path/to/dataset2 \
where:
<model_name>
is a name you choose<language_code>
is a supported language from gruut likeen-us
If all goes well, you should now have a Kaldi recipe directory under egs/<model_name>/s5
.
Before training, you must place a gzipped ARPA language model at egs/<model_name>/s5/lm/lm.arpa.gz
After that, run:
$ cd /path/to/kaldi/egs/<model_name>/s5
$ ./run.sh
This will train a new TDNN nnet3 model in the recipe directory. It can take a day or two, depending on how powerful your computer is. If a particular training stage fails (see run.sh
), you can resume with ./run.sh --stage N
where N
is the stage to start at.
The typical training workflow is described below.
- Training transcriptions are tokenized and cleaned using gruut
- Vocabulary words looked up in IPA lexicon(s)
- Unknown words have pronunciations guessed with phonetisaurus model trained on IPA lexicon(s)
- Lexicon is created from generated/pre-built pronunciations
- Use
<unk>
for unknown word - Use SPN (spoken noise) silence phoneme for
<unk>
- Use
- Kaldi recipe files are generated
- Non-silence phones are manually grouped for
extra_questions.txt
- SIL, SPN, NSN silence phones
- SIL is optional
- Non-silence phones are manually grouped for
- Kaldi test/train files are generated
- 10%/90% data split
- wav.scp, text, and utt2spk
- Do Kaldi training with
run.sh
script- Prepares dict/lang directories
- Adapts language model for Kaldi
- Creates MFCC features
- Trains monophone system
- Trains triphone system (1b)
- Trains triphone system (2b)
- Generates iVectors
- Generates topology
- Gets alignment lattices
- Builds tree
- Trains TDNN 250 nnet3 model
The output of this project is a Kaldi recipe that lives inside your Kaldi egs
directory, such as /path/to/kaldi/egs/rhasspy_nnet3_en-us/s5
. When scripts/doit.sh
succeeds, this directory should contain the following files:
- s5/
- run.sh
- export.sh
- data/
- conf/
- mfcc.conf
- mfcc_hires.conf
- online_cmvn.conf
- local/
- dict/
- lexicon.txt.gz
- WORD P1 P2 ...
- nonsilence_phones.txt
- Actual phonemes
- silence_phones.txt
- SIL
- SPN
- NSN
- optional_silence.txt
- SIL
- extra_questions.txt
- Phones grouped by accents/elongation
- lexicon.txt.gz
- dict/
- train/
- wav.scp
- UTT_ID /path/to/wav
- Sorted by UTT_ID
- utt2spk
- UTT_ID speaker
- Sorted by UTT_ID, then speaker
- text
- UTT_ID transcription
- wav.scp
- test/
- wav.scp
- Same as train
- utt2spk
- Same as train
- text
- Same as train
- wav.scp
- conf/
- lm/
- lm.arpa.gz
- ARPA language model
- lm.arpa.gz
Below are summarized instructions from this Medium article for Ubuntu 18.04 (bionic) with CUDA 10.2 and cuDNN 7.6.
First, add the CUDA repos:
$ sudo apt update
$ sudo add-apt-repository ppa:graphics-drivers
$ sudo apt-key adv --fetch-keys 'http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub'
$ sudo bash -c 'echo "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /" > /etc/apt/sources.list.d/cuda.list'
$ sudo bash -c 'echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/ /" > /etc/apt/sources.list.d/cuda_learn.list'
Next, install CUDA and cuDNN:
$ sudo apt update
$ sudo apt install cuda-10-2
$ sudo apt install libcudnn7
If installation succeeds, add the following text to ~/.profile
# set PATH for cuda 10.2 installation in ~/.profile
if [ -d "/usr/local/cuda-10.2/bin/" ]; then
export PATH=/usr/local/cuda-10.2/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
fi
After rebooting, check if everything works by running nvidia-smi
and verifying the version of CUDA reported.
Install dependencies:
$ sudo apt-get update
$ sudo apt-get install \
build-essential \
wget curl ca-certificates \
libatlas-base-dev libatlas3-base gfortran \
automake autoconf unzip sox libtool subversion \
python3 python \
git zlib1g-dev patchelf rsync
Download the Kaldi source code:
$ git clone git clone https://github.com/kaldi-asr/kaldi.git
Build dependencies (replace -j8
with -j4
if you have fewer CPU cores):
$ cd kaldi/tools
$ make -j8
Build Kaldi itself (replace -j8
with -j4
if you have fewer CPU cores):
$ cd ../src
$ ./configure --use-cuda --shared --mathlib=ATLAS
$ make depend -j8
$ make -j8
See the getting started guide if you have problems.
The following nnet3
models have been trained with ipa2kaldi
using public speech data:
- Czech
- French
- Italian
- Spanish
These models are intended to be used with rhasspy-asr-kaldi from the Rhasspy voice assistant.