Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STT 1.4.0 - CommonVoice 12 #168

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions STT/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Dockerfile for producing french model

## Licensing:

This model is available under the terms of the MPL 2.0 (see `LICENSE.txt`).

## Prerequistes:

* Ensure you have a running setup of [`Docker` working with GPU support](https://docs.docker.com/config/containers/resource_constraints/#gpu)
* Prepare a host directory with enough space for training / producing intermediate data (>=400GB).
* Ensure it's writable by `trainer` (uid 999 by default) user (defined in the Dockerfile).
* For Common Voice dataset, please make sure you have downloaded the dataset prior to running (behind email)
Place `cv-corpus-*-fr` inside your host directory, in a `sources/` subdirectory.

## Build the image:

```
$ docker build [--build-arg ARG=val] -f Dockerfile.train -t commonvoice-fr .
```

Several parameters can be customized:
- `stt_repo` to fetch STT from a different repo than upstream
- `stt_branch` to checkout a specific branch / commit
- `stt_sha1` commit to pull from when installing pre-built binaries
- `kenlm_repo`, `kenlm_branch` for the same parameters for KenLM
- `english_compatible` set to 1 if you want the importers to be run in
"English-compatible mode": this will affect behavior such as english
alphabet file can be re-used, when doing transfer-learning from English
checkpoints for example.
- `lm_evaluate_range`, if non empty, this will perform a LM alpha/beta evaluation
the parameter is expected to be of the form: `lm_alpha_max`,`lm_beta_max`,`n_trials`.
See upstream `lm_optimizer.py` for details
- `lm_add_excluded_max_sec` set to 1 adds excluded sentences that were too long to the language model.

Some parameters for the model itself:
- `train_batch_size` to specify the batch size for training dataset
- `dev_batch_size` to specify the batch size for dev dataset
- `test_batch_size` to specify the batch size for test dataset
- `epoch` to specify the number of epochs to run training for
- `learning_rate` to define the learning rate of the network
- `dropout` to define the dropout applied
- `lm_alpha`, `lm_beta` to control language model alpha and beta parameters
- `amp` to enable or disable automatic mixed precision
- `skip_batch_test` to skip or not batch test completely
- `duplicate_sentence_count` to control if Common Voice dataset might need
to be regenerated with more duplicated allowed using Corpora Creator
**USE WITH CAUTION**
- `enable_augments` to help the model to better generalise on noisy data by augmenting the data in various ways.
- `augmentation_arguments` to set `augments_file` path to give augemntation parameters.
- `augments.txt`: `augments_file` containing arguments to use for data argumentation if `enable_augments` is set to 1.
- `cv_personal_first_url` to download only your own voice instead of all Common Voice dataset (first url).
- `cv_personal_second_url` to download only your own voice instead of all Common Voice dataset (second url).

Language specific things needs to be under a language directory. Have a look at `fr/` for an example:
- `importers.sh`: script to run all the importers
- `metadata.sh`: script exporting variables to define model metadata used at export time
- `params.sh`: script exporting variables to define dataset-level parameters, e.g.,
Common Voice release filename, sha256 value, Lingua Libre language
parameters, etc.
- `prepare_lm.sh`: prepare text content for producing external scorer. This
should produce a `sources_lm.txt` file.

Miscellaneous parameters:
- `use_tf_random_gen_state`: set to 0 if your GPU doesn't need to use Tensorflow's CuDNN random state generation to train.

Pay attention to automatic mixed precision: it will speed up the training
process (by itself and because it allows to increase batch size). However,
this is only viable when you are experimenting on hyper-parameters. Proper
selection of best evaluating model seems to vary much more when AMP is enabled
than when it is disabled. So use with caution when tuning parameters and
disable it when making a release.

Default values should provide good experience.

The default batch size has been tested with this mix of dataset:
- Common Voice French, released on april 2022 (v9.0)
- TrainingSpeech as of 2019, april 11th
- Lingua Libre as of 2020, april 25th
- OpenSLR 57: African Accented French
- OpenSLR 94: Att-HACK
- M-AILABS french dataset
- MLS French dataset

### Transfer learning from pre-trained checkpoints

To perform transfer learning, please download and make a read-to-use directory
containing the checkpoint to use. Ready-to-use means directly re-usable checkpoints
files, with proper `checkpoint` descriptor as TensorFlow produces.

To use an existing checkpoint, just ensure the `docker run` includes a mount such as:
`type=bind,src=PATH/TO/CHECKPOINTS,dst=/transfer-checkpoint`. Upon running, the checkpoints will be automatically used as starting point.

Checkpoints don't typically use automatic mixed precision nor fully-connected layer normalization and mostly use a standard number of hidden layers (2048 unless specified otherwise). So don't change those parameters to fine-tune from them.

## Hardware

Training successfull on:

> - Threadripper 3950X + 128GB RAM
> - 2x RTX 2080 Ti
> - Debian Sid, kernel 5.7, driver 440.100

> - Threadripper 2920X + 96GB RAM
> - 2x Titan RTX
> - Manjaro (Arch) Linux, kernel 5.15.32-1-MANJARO, driver 510.60.02


With ~1000h of audio, one training epoch takes ~23min (Automatic Mixed Precision enabled)

## Run the image:

The `mount` option is really important: this is where intermediate files, training, checkpoints as
well as final model files will be produced.

```
$ docker run --it --gpus=all --mount type=bind,src=PATH/TO/HOST/DIRECTORY,dst=/mnt --env TRAIN_BATCH_SIZE=64 commonvoice-fr
```

Training parameters can be changed at runtime as well using environment variables.
258 changes: 258 additions & 0 deletions STT/Dockerfile.train
Original file line number Diff line number Diff line change
@@ -0,0 +1,258 @@
FROM nvcr.io/nvidia/tensorflow:22.02-tf1-py3

ARG stt_repo=coqui-ai/STT
ARG stt_branch=fcec06bdd89f6ae68e2599495e8471da5e5ba45e
ARG stt_sha1=fcec06bdd89f6ae68e2599495e8471da5e5ba45e
ARG cc_repo=mozilla/CorporaCreator
ARG cc_sha1=73622cf8399f8e634aee2f0e76dacc879226e3ac
ARG kenlm_repo=kpu/kenlm
ARG kenlm_branch=87e85e66c99ceff1fab2500a7c60c01da7315eec

# Model parameters
ARG model_language=fr
ENV MODEL_LANGUAGE=$model_language

# Training hyper-parameters
ARG train_batch_size=64
ENV TRAIN_BATCH_SIZE=$train_batch_size

ARG dev_batch_size=64
ENV DEV_BATCH_SIZE=$dev_batch_size

ARG test_batch_size=64
ENV TEST_BATCH_SIZE=$test_batch_size

ARG n_hidden=2048
ENV N_HIDDEN=$n_hidden

ARG epochs=40
ENV EPOCHS=$epochs

ARG learning_rate=0.0001
ENV LEARNING_RATE=$learning_rate

ARG dropout=0.3
ENV DROPOUT=$dropout

ARG lm_top_k=500000
ENV LM_TOP_K=500000

ARG lm_alpha=0.0
ENV LM_ALPHA=$lm_alpha

ARG lm_beta=0.0
ENV LM_BETA=$lm_beta

ARG beam_width=500
ENV BEAM_WIDTH=$beam_width

ARG early_stop=1
ENV EARLY_STOP=$early_stop

ARG amp=0
ENV AMP=$amp

# Skipping batch test to avoid hanging processes
# Should be set to 0 by default once STT#2195 is fixed
# See https://github.com/coqui-ai/STT/issues/2195 for more details
wasertech marked this conversation as resolved.
Show resolved Hide resolved
ARG skip_batch_test=1
ENV SKIP_BATCH_TEST=$skip_batch_test

# Dataset management
ARG duplicate_sentence_count=1
ENV DUPLICATE_SENTENCE_COUNT=$duplicate_sentence_count

# Should be of the form: lm_alpha_max,lm_beta_max,n_trials
ARG lm_evaluate_range=
ENV LM_EVALUATE_RANGE=$lm_evaluate_range

# Data augmentation
ARG enable_augments=0
ENV ENABLE_AUGMENTS=$enable_augments

ARG augmentation_arguments="augments.txt"
ENV AUGMENTATION_ARGUMENTS=$augmentation_arguments

# Others
ARG english_compatible=0
ENV ENGLISH_COMPATIBLE=$english_compatible

ARG lm_add_excluded_max_sec=0
ENV LM_ADD_EXCLUDED_MAX_SEC=$lm_add_excluded_max_sec

# To fine-tune using your own data
ARG cv_personal_first_url=
ENV CV_PERSONAL_FIRST_URL=$cv_personal_first_url

ARG cv_personal_second_url=
ENV CV_PERSONAL_SECOND_URL=$cv_personal_second_url

ARG log_level=1
ENV LOG_LEVEL=$log_level

ARG uid=999
ENV UID=$uid

ARG gid=999
ENV GID=$gid

# Configure random state
# Required for trainig on newer GPUs such as series 30/40.
# You can safely disable it (set to 0) if your GPU doesn't need it.
ARG use_tf_random_gen_state=1
ENV TF_CUDNN_RESET_RND_GEN_STATE=$use_tf_random_gen_state

# Make sure we can extract filenames with UTF-8 chars
ENV LANG=C.UTF-8

# Avoid keyboard-configuration step
ENV DEBIAN_FRONTEND noninteractive

ENV HOMEDIR /home/trainer

ENV VIRTUAL_ENV_NAME stt-train
ENV VIRTUAL_ENV $HOMEDIR/$VIRTUAL_ENV_NAME
ENV STT_DIR $HOMEDIR/stt
ENV CC_DIR $HOMEDIR/cc

ENV STT_BRANCH=$stt_branch
ENV STT_SHA1=$stt_sha1

ENV PATH="$VIRTUAL_ENV/bin:${HOMEDIR}/tf-venv/bin:$PATH"

RUN env

# Get basic packages
RUN apt-get -qq update && apt-get -qq install -y --no-install-recommends \
build-essential \
curl \
wget \
git \
python3 \
python3-pip \
ca-certificates \
cmake \
libboost-all-dev \
zlib1g-dev \
libbz2-dev \
liblzma-dev \
libmagic-dev \
libopus0 \
libopusfile0 \
libsndfile1 \
libeigen3-dev \
pkg-config \
g++ \
python3-venv \
unzip \
pixz \
sox \
sudo \
libsox-fmt-all \
ffmpeg \
locales locales-all \
xz-utils \
software-properties-common

# For exporting using TFLite
RUN add-apt-repository ppa:deadsnakes/ppa -y
wasertech marked this conversation as resolved.
Show resolved Hide resolved

RUN apt-get -qq update && apt-get -qq install -y --no-install-recommends \
python3.7 \
python3.7-venv

RUN groupadd -g $GID trainer && \
adduser --system --uid $UID --group trainer

RUN echo "trainer ALL=(root) NOPASSWD:ALL" > /etc/sudoers.d/trainer && \
chmod 0440 /etc/sudoers.d/trainer

# Below that point, nothing requires being root
USER trainer

WORKDIR $HOMEDIR

RUN git clone https://github.com/$kenlm_repo.git ${HOMEDIR}/kenlm && cd ${HOMEDIR}/kenlm && git checkout $kenlm_branch \
&& mkdir -p build \
&& cd build \
&& cmake .. \
&& make -j

WORKDIR $HOMEDIR

RUN python3 -m venv --system-site-packages $VIRTUAL_ENV_NAME

# Venv for upstream tensorflow with tflite api
RUN python3.7 -m venv ${HOME}/tf-venv

ENV PATH=$HOMEDIR/$VIRTUAL_ENV_NAME/bin:$PATH

RUN git clone https://github.com/$stt_repo.git $STT_DIR

WORKDIR $STT_DIR

RUN git checkout $stt_branch

WORKDIR $STT_DIR

RUN pip install --upgrade pip wheel setuptools

# Build CTC decoder first, to avoid clashes on incompatible versions upgrades
RUN cd native_client/ctcdecode && make NUM_PROCESSES=$(nproc) bindings
RUN pip install --upgrade native_client/ctcdecode/dist/*.whl

# Install STT
# No need for the decoder since we did it earlier
# TensorFlow GPU should already be installed on the base image,
# and we don't want to break that
RUN DS_NODECODER=y DS_NOTENSORFLOW=y pip install --upgrade --force-reinstall -e .

# Install coqui_stt_training (inside tf-venv) for exporting models using tflite
RUN ${HOME}/tf-venv/bin/pip install -e .

# Pre-built native client tools
RUN LATEST_STABLE_RELEASE=$(curl "https://api.github.com/repos/coqui-ai/STT/releases/latest" | python -c 'import sys; import json; print(json.load(sys.stdin)["tag_name"])') \
bash -c 'curl -L https://github.com/coqui-ai/STT/releases/download/${LATEST_STABLE_RELEASE}/native_client.tflite.Linux.tar.xz | tar -xJvf -' && ls -hal generate_scorer_package

WORKDIR $HOMEDIR

RUN git clone https://github.com/$cc_repo.git $CC_DIR

WORKDIR $CC_DIR

RUN git checkout $cc_sha1

WORKDIR $CC_DIR

# Copy copora patch
COPY --chown=trainer:trainer corpora.patch $CC_DIR

RUN patch -p1 < corpora.patch

# error: parso 0.7.0 is installed but parso<0.9.0,>=0.8.0 is required by {'jedi'}
# modin has this wierd strict but implicit dependency: swifter<1.1.0
RUN pip install parso==0.8.3 'swifter<1.1.0'

RUN pip install modin[all]

RUN python setup.py install

# For CC PMF importer
RUN pip install num2words zipfile38

# Fix numpy and pandas version
RUN python -m pip install 'numpy<1.19.0,>=1.16.0' 'pandas<1.4.0dev0,>=1.0'

# Use yaml in bash to get best lm alpha and beta from opt for export
RUN python -m pip install shyaml

WORKDIR $HOMEDIR

ENV PATH="${HOMEDIR}/kenlm/build/bin/:$PATH"

# Copy now so that docker build can leverage caches
COPY --chown=trainer:trainer . $HOMEDIR/

COPY --chown=trainer:trainer ${MODEL_LANGUAGE}/ $HOMEDIR/${MODEL_LANGUAGE}/

ENTRYPOINT "$HOMEDIR/run.sh"
Loading