common-voice · wasertech · Mar 5, 2023 · Mar 23, 2023 · Mar 23, 2023 · Mar 23, 2023
diff --git a/STT/CONTRIBUTING.md b/STT/CONTRIBUTING.md
@@ -0,0 +1,114 @@
+# Dockerfile for producing french model
+
+## Licensing:
+
+This model is available under the terms of the MPL 2.0 (see `LICENSE.txt`).
+
+## Prerequistes:
+
+* Ensure you have a running setup of [`Docker` working with GPU support](https://docs.docker.com/config/containers/resource_constraints/#gpu)
+* Prepare a host directory with enough space for training / producing intermediate data (>=400GB).
+* Ensure it's writable by `trainer` (uid 999 by default) user (defined in the Dockerfile).
+* For Common Voice dataset, please make sure you have downloaded the dataset prior to running (behind email)
+  Place `cv-corpus-*-fr` inside your host directory, in a `sources/` subdirectory.
+
+## Build the image:
+
+```
+$ docker build [--build-arg ARG=val] -f Dockerfile.train -t commonvoice-fr .
+```
+
+Several parameters can be customized:
+ - `stt_repo` to fetch STT from a different repo than upstream
+ - `stt_branch` to checkout a specific branch / commit
+ - `stt_sha1` commit to pull from when installing pre-built binaries
+ - `kenlm_repo`, `kenlm_branch` for the same parameters for KenLM
+ - `english_compatible` set to 1 if you want the importers to be run in
+    "English-compatible mode": this will affect behavior such as english
+    alphabet file can be re-used, when doing transfer-learning from English
+    checkpoints for example.
+ - `lm_evaluate_range`, if non empty, this will perform a LM alpha/beta evaluation
+    the parameter is expected to be of the form: `lm_alpha_max`,`lm_beta_max`,`n_trials`.
+    See upstream `lm_optimizer.py` for details
+ - `lm_add_excluded_max_sec` set to 1 adds excluded sentences that were too long to the language model.
+
+Some parameters for the model itself:
+ - `train_batch_size` to specify the batch size for training dataset
+ - `dev_batch_size` to specify the batch size for dev dataset
+ - `test_batch_size` to specify the batch size for test dataset
+ - `epoch` to specify the number of epochs to run training for
+ - `learning_rate` to define the learning rate of the network
+ - `dropout` to define the dropout applied
+ - `lm_alpha`, `lm_beta` to control language model alpha and beta parameters
+ - `amp` to enable or disable automatic mixed precision
+ - `skip_batch_test` to skip or not batch test completely
+ - `duplicate_sentence_count` to control if Common Voice dataset might need
+    to be regenerated with more duplicated allowed using Corpora Creator
+    **USE WITH CAUTION**
+ - `enable_augments` to help the model to better genralise on noisy data by augmenting the data in various ways.
+ - `cv_personal_first_url` to download only your own voice instead of all Common Voice dataset (first url)
+ - `cv_personal_second_url` to download only your own voice instead of all Common Voice dataset (second url)
+
+Language specific things needs to be under a language directory. Have a look at `fr/` for an example:
+ - `importers.sh`: script to run all the importers
+ - `metadata.sh`: script exporting variables to define model metadata used at export time
+ - `params.sh`: script exporting variables to define dataset-level parameters, e.g.,
+                Common Voice release filename, sha256 value, Lingua Libre language
+		parameters, etc.
+ - `prepare_lm.sh`: prepare text content for producing external scorer. This
+                    should produce a `sources_lm.txt` file.
+
+Pay attention to automatic mixed precision: it will speed up the training
+process (by itself and because it allows to increase batch size). However,
+this is only viable when you are experimenting on hyper-parameters. Proper
+selection of best evaluating model seems to vary much more when AMP is enabled
+than when it is disabled. So use with caution when tuning parameters and
+disable it when making a release.
+
+Default values should provide good experience.
+
+The default batch size has been tested with this mix of dataset:
+ - Common Voice French, released on april 2022 (v9.0)
+ - TrainingSpeech as of 2019, april 11th
+ - Lingua Libre as of 2020, april 25th
+ - OpenSLR 57: African Accented French
+ - OpenSLR 94: Att-HACK
+ - M-AILABS french dataset
+ - MLS French dataset
+
+### Transfer learning from pre-trained checkpoints
+
+To perform transfer learning, please download and make a read-to-use directory
+containing the checkpoint to use. Ready-to-use means directly re-usable checkpoints
+files, with proper `checkpoint` descriptor as TensorFlow produces.
+
+To use an existing checkpoint, just ensure the `docker run` includes a mount such as:
+`type=bind,src=PATH/TO/CHECKPOINTS,dst=/transfer-checkpoint`. Upon running, the checkpoints will be automatically used as starting point.
+
+Checkpoints don't typically use automatic mixed precision nor fully-connected layer normalization and mostly use a standard number of hidden layers (2048 unless specified otherwise). So don't change those parameters to fine-tune from them.
+
+## Hardware
+
+Training successfull on:
+
+> - Threadripper 3950X + 128GB RAM
+> - 2x RTX 2080 Ti
+> - Debian Sid, kernel 5.7, driver 440.100
+
+> - Threadripper 2920X + 96GB RAM
+> - 2x Titan RTX
+> - Manjaro (Arch) Linux, kernel 5.15.32-1-MANJARO, driver 510.60.02
+
+
+With ~1000h of audio, one training epoch takes ~23min (Automatic Mixed Precision enabled)
+
+## Run the image:
+
+The `mount` option is really important: this is where intermediate files, training, checkpoints as
+well as final model files will be produced.
+
+```
+$ docker run --it --gpus=all --mount type=bind,src=PATH/TO/HOST/DIRECTORY,dst=/mnt --env TRAIN_BATCH_SIZE=64 commonvoice-fr
+```
+
+Training parameters can be changed at runtime as well using environment variables.
diff --git a/STT/Dockerfile.train b/STT/Dockerfile.train
@@ -0,0 +1,249 @@
+FROM nvcr.io/nvidia/tensorflow:22.02-tf1-py3
+
+ARG stt_repo=coqui-ai/STT
+ARG stt_branch=fcec06bdd89f6ae68e2599495e8471da5e5ba45e
+ARG stt_sha1=fcec06bdd89f6ae68e2599495e8471da5e5ba45e
+ARG cc_repo=mozilla/CorporaCreator
+ARG cc_sha1=73622cf8399f8e634aee2f0e76dacc879226e3ac
+ARG kenlm_repo=kpu/kenlm
+ARG kenlm_branch=87e85e66c99ceff1fab2500a7c60c01da7315eec
+
+# Model parameters
+ARG model_language=fr
+ENV MODEL_LANGUAGE=$model_language
+
+# Training hyper-parameters
+ARG train_batch_size=64
+ENV TRAIN_BATCH_SIZE=$train_batch_size
+
+ARG dev_batch_size=64
+ENV DEV_BATCH_SIZE=$dev_batch_size
+
+ARG test_batch_size=64
+ENV TEST_BATCH_SIZE=$test_batch_size
+
+ARG n_hidden=2048
+ENV N_HIDDEN=$n_hidden
+
+ARG epochs=40
+ENV EPOCHS=$epochs
+
+ARG learning_rate=0.0001
+ENV LEARNING_RATE=$learning_rate
+
+ARG dropout=0.3
+ENV DROPOUT=$dropout
+
+ARG lm_top_k=500000
+ENV LM_TOP_K=500000
+
+ARG lm_alpha=0.0
+ENV LM_ALPHA=$lm_alpha
+
+ARG lm_beta=0.0
+ENV LM_BETA=$lm_beta
+
+ARG beam_width=500
+ENV BEAM_WIDTH=$beam_width
+
+ARG early_stop=1
+ENV EARLY_STOP=$early_stop
+
+ARG amp=0
+ENV AMP=$amp
+
+# Skipping batch test to avoid hanging processes
+# Should be set to 0 by default once STT#2195 is fixed
+# See https://github.com/coqui-ai/STT/issues/2195 for more details
+ARG skip_batch_test=1
+ENV SKIP_BATCH_TEST=$skip_batch_test
+
+# Dataset management
+ARG duplicate_sentence_count=1
+ENV DUPLICATE_SENTENCE_COUNT=$duplicate_sentence_count
+
+# Should be of the form: lm_alpha_max,lm_beta_max,n_trials
+ARG lm_evaluate_range=
+ENV LM_EVALUATE_RANGE=$lm_evaluate_range
+
+# Data augmentation
+ARG enable_augments=0
+ENV ENABLE_AUGMENTS=$enable_augments
+
+# Others
+ARG english_compatible=0
+ENV ENGLISH_COMPATIBLE=$english_compatible
+
+ARG lm_add_excluded_max_sec=0
+ENV LM_ADD_EXCLUDED_MAX_SEC=$lm_add_excluded_max_sec
+
+# To fine-tune using your own data
+ARG cv_personal_first_url=
+ENV CV_PERSONAL_FIRST_URL=$cv_personal_first_url
+
+ARG cv_personal_second_url=
+ENV CV_PERSONAL_SECOND_URL=$cv_personal_second_url
+
+ARG log_level=1
+ENV LOG_LEVEL=$log_level
+
+ARG uid=999
+ENV UID=$uid
+
+ARG gid=999
+ENV GID=$gid
+
+# Make sure we can extract filenames with UTF-8 chars
+ENV LANG=C.UTF-8
+
+# Avoid keyboard-configuration step
+ENV DEBIAN_FRONTEND noninteractive
+
+ENV HOMEDIR /home/trainer
+
+ENV VIRTUAL_ENV_NAME stt-train
+ENV VIRTUAL_ENV $HOMEDIR/$VIRTUAL_ENV_NAME
+ENV STT_DIR $HOMEDIR/stt
+ENV CC_DIR $HOMEDIR/cc
+
+ENV STT_BRANCH=$stt_branch
+ENV STT_SHA1=$stt_sha1
+
+ENV PATH="$VIRTUAL_ENV/bin:${HOMEDIR}/tf-venv/bin:$PATH"
+
+RUN env
+
+# Get basic packages
+RUN apt-get -qq update && apt-get -qq install -y --no-install-recommends \
+    build-essential \
+    curl \
+    wget \
+    git \
+    python3 \
+    python3-pip \
+    ca-certificates \
+    cmake \
+    libboost-all-dev \
+    zlib1g-dev \
+    libbz2-dev \
+    liblzma-dev \
+    libmagic-dev \
+    libopus0 \
+    libopusfile0 \
+    libsndfile1 \
+    libeigen3-dev \
+    pkg-config \
+    g++ \
+    python3-venv \
+    unzip \
+    pixz \
+    sox \
+    sudo \
+    libsox-fmt-all \
+    ffmpeg \
+    locales locales-all \
+    xz-utils \
+    software-properties-common
+
+# For exporting using TFLite
+RUN add-apt-repository ppa:deadsnakes/ppa -y
+
+RUN apt-get -qq update && apt-get -qq install -y --no-install-recommends \
+    python3.7 \
+    python3.7-venv
+
+RUN groupadd -g $GID trainer && \
+    adduser --system --uid $UID --group trainer
+
+RUN echo "trainer ALL=(root) NOPASSWD:ALL" > /etc/sudoers.d/trainer && \
+    chmod 0440 /etc/sudoers.d/trainer
+
+# Below that point, nothing requires being root
+USER trainer
+
+WORKDIR $HOMEDIR
+
+RUN git clone https://github.com/$kenlm_repo.git ${HOMEDIR}/kenlm && cd ${HOMEDIR}/kenlm && git checkout $kenlm_branch \
+    && mkdir -p build \
+    && cd build \
+    && cmake .. \
+    && make -j
+
+WORKDIR $HOMEDIR
+
+RUN python3 -m venv --system-site-packages $VIRTUAL_ENV_NAME
+
+# Venv for upstream tensorflow with tflite api
+RUN python3.7 -m venv ${HOME}/tf-venv
+
+ENV PATH=$HOMEDIR/$VIRTUAL_ENV_NAME/bin:$PATH
+
+RUN git clone https://github.com/$stt_repo.git $STT_DIR
+
+WORKDIR $STT_DIR
+
+RUN git checkout $stt_branch
+
+WORKDIR $STT_DIR
+
+RUN pip install --upgrade pip wheel setuptools
+
+# Build CTC decoder first, to avoid clashes on incompatible versions upgrades
+RUN cd native_client/ctcdecode && make NUM_PROCESSES=$(nproc) bindings
+RUN pip install --upgrade native_client/ctcdecode/dist/*.whl
+
+# Install STT
+# No need for the decoder since we did it earlier
+# TensorFlow GPU should already be installed on the base image,
+# and we don't want to break that
+RUN DS_NODECODER=y DS_NOTENSORFLOW=y pip install --upgrade --force-reinstall -e .
+
+# Install coqui_stt_training (inside tf-venv) for exporting models using tflite
+RUN ${HOME}/tf-venv/bin/pip install -e .
+
+# Pre-built native client tools
+RUN LATEST_STABLE_RELEASE=$(curl "https://api.github.com/repos/coqui-ai/STT/releases/latest" | python -c 'import sys; import json; print(json.load(sys.stdin)["tag_name"])') \
+ bash -c 'curl -L https://github.com/coqui-ai/STT/releases/download/${LATEST_STABLE_RELEASE}/native_client.tflite.Linux.tar.xz | tar -xJvf -' && ls -hal generate_scorer_package 
+
+WORKDIR $HOMEDIR
+
+RUN git clone https://github.com/$cc_repo.git $CC_DIR
+
+WORKDIR $CC_DIR
+
+RUN git checkout $cc_sha1
+
+WORKDIR $CC_DIR
+
+# Copy copora patch
+COPY --chown=trainer:trainer corpora.patch $CC_DIR
+
+RUN patch -p1 < corpora.patch
+
+# error: parso 0.7.0 is installed but parso<0.9.0,>=0.8.0 is required by {'jedi'}
+# modin has this wierd strict but implicit dependency: swifter<1.1.0
+RUN pip install parso==0.8.3 'swifter<1.1.0'
+
+RUN pip install modin[all]
+
+RUN python setup.py install
+
+# For CC PMF importer
+RUN pip install num2words zipfile38
+
+# Fix numpy and pandas version
+RUN python -m pip install 'numpy<1.19.0,>=1.16.0' 'pandas<1.4.0dev0,>=1.0'
+
+# Use yaml in bash to get best lm alpha and beta from opt for export
+RUN python -m pip install shyaml
+
+WORKDIR $HOMEDIR
+
+ENV PATH="${HOMEDIR}/kenlm/build/bin/:$PATH"
+
+# Copy now so that docker build can leverage caches
+COPY --chown=trainer:trainer . $HOMEDIR/
+
+COPY --chown=trainer:trainer ${MODEL_LANGUAGE}/ $HOMEDIR/${MODEL_LANGUAGE}/
+
+ENTRYPOINT "$HOMEDIR/run.sh"