Skip to content

Commit

Permalink
gwellianau profi a CV6 / testing improvements and CV6
Browse files Browse the repository at this point in the history
  • Loading branch information
DewiBrynJones committed Jan 25, 2021
1 parent 2b17bc3 commit 5ee2465
Show file tree
Hide file tree
Showing 13 changed files with 430 additions and 132 deletions.
4 changes: 3 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,15 @@ RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.d
&& apt-get clean \
&& git lfs install \
&& pip install sox wget sklearn pandas python_speech_features virtualenv \
webrtcvad requests tqdm columnize praatio srt \
webrtcvad requests tqdm columnize praatio srt GitPython pydub \
&& rm -rf /var/lib/apt/lists/*


ENV LC_ALL cy_GB.UTF-8
ENV LANG cy_GB.UTF-8
ENV LANGUAGE cy_GB.UTF-8


#
WORKDIR /DeepSpeech/native_client

Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
default: build

DEEPSPEECH_RELEASE := 0.9.1
TECHIAITH_RELEASE := 20.12
TECHIAITH_RELEASE := 21.01

run:
docker run --gpus all --name techiaith-deepspeech-v${DEEPSPEECH_RELEASE}-${USER} -it \
Expand Down
25 changes: 17 additions & 8 deletions local/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,23 @@ Mae'r sgriptiau canlynol yn cysylltu ag yn hwyluso'r holl gamau a ddilynir er mw

## Rhagofynion

Llwythwch i lawr data lleferydd Cymraeg o wefan CommonVoice: https://voice.mozilla.org/cy/datasets sy'n cael ei ddarparu fel un ffeil mawr wedi'i gwasgu (e.e. `cy.tar.gz`) . Cadwch y ffeil o fewn y ffolder `data`.
Llwythwch i lawr data lleferydd Cymraeg o wefan CommonVoice: https://voice.mozilla.org/cy/datasets sy'n cael ei ddarparu fel un ffeil mawr wedi'i gwasgu (e.e. `cy.tar.gz`) . Cadwch y ffeil o fewn y ffolder `data/commonvoice`.

Llwythwch i lawr hefyd Corpws OSCAR o https://oscar-public.huma-num.fr/shuff-orig/cy sy'n cynnwys testunau Cymraeg o'r we. Bydd angen cofrestru er mwyn i'r wefan caniatau lawrlwytho. Cadwch y ffeil o fewn y ffolder `data/oscar`.


## Paratoi Data

### `import_audio_archive.py`

```shell
root@c67722092f2e:/DeepSpeech# bin/bangor_welsh/import_audio_archive.py --archive cy.tar.gz --target_dir /data/commonvoice-cy-v5-20200622/
root@c67722092f2e:/DeepSpeech# bin/bangor_welsh/import_audio_archive.py --archive /data/commonvoice/cy.tar.gz --target_dir /data/commonvoice/
```

### `analyze_audio.py`

```shell
root@c67722092f2e:/DeepSpeech# /DeepSpeech/bin/bangor_welsh/analyze_audio.py --csv_dir /data/commonvoice-cy-v5-20200622/clips/
root@c67722092f2e:/DeepSpeech# /DeepSpeech/bin/bangor_welsh/analyze_audio.py --csv_dir /data/commonvoice/clips/
/data/commonvoice-cy-v5-20200622/clips/dev.csv 0.91 hours (3269.93 seconds)
/data/commonvoice-cy-v5-20200622/clips/test.csv 0.98 hours (3514.49 seconds)
/data/commonvoice-cy-v5-20200622/clips/train.csv 1.09 hours (3941.04 seconds)
Expand All @@ -41,7 +43,7 @@ Defnyddiwch y sgript ganlynol i hyfforddi model acwstig. Dyle paramedr `-a` nodi
Mae'r sgript hon yn defnyddio nodwedd dysgu trosglwyddol (*transfer learning*) DeepSpeech er mwyn cael fudd o ddefnyddio modelau acwstig Saesneg Mozilla, sydd wedi'u hyfforddi ar gasgliadau data llawer mwy o sain, fel man cychwyn ar gyfer hyfforddi adnabod lleferydd Cymraeg.

```shell
root@c67722092f2e:/DeepSpeech# /DeepSpeech/bin/bangor_welsh/run_tl_cv_cy.sh -a /data/commonvoice-cy-v5-20200622/clips
root@c67722092f2e:/DeepSpeech# /DeepSpeech/bin/bangor_welsh/run_tl_cv_cy.sh -a /data/commonvoice/clips
```


Expand All @@ -53,20 +55,27 @@ Nid yw model acwstig ar ei ben ei hun, er ei fod wedi defnyddio technegau dysgu

Mae angen rhagor o adnoddau gan Brifysgol Bangor er mwyn hyfforddi DeepSpeech ar gyfer adnabod lleferydd Cymraeg mewn gwahanol gyd-destunau defnyddiol.

Mae'r sgript isod yn llwytho i lawr rhagor o recordiadau a corpora testun sydd yn galluogi adnabod lleferydd Cymraeg o fewn cynorthwyydd digidol ('macsen') neu drawsgrifiwr ('transcribe') (fel yr ofynnir ym mharamedr `-d`)
Mae'r sgript isod yn llwytho i lawr rhagor o recordiadau a corpora testun sydd yn galluogi adnabod lleferydd Cymraeg o fewn cynorthwyydd digidol a trawsgrifiwr. Rhaid i chi llwytho i lawr ffeil archif corpws testun OSCAR o flaen llaw er mwyn ei ddefnyddio gyda'r orchymyn isod:

```shell
root@6a88b0d59848:/DeepSpeech# bin/bangor_welsh/import_bangor_resources.py -t /data/macsen -d macsen
root@6a88b0d59848:/DeepSpeech# bin/bangor_welsh/import_bangor_resources.py -o /data/oscar/cy.txt.gzip -c /data/commonvoice/validated.tsv
```

Mae'r sgript mewnforio hefyd yn hidlo unrhyw testunau sy'n anaddas i'r proses hyfforddi modelau iaith adnabod lleferydd ac yn creu copi 'glan' (`.clean`) o'r corpws.


### `build_lm_scorer.sh`

Dyma'r brif sgript ar gyfer hyfforddi model iaith ac yna ei werthuso gyda model acwstig o gamau blaenorol hyfforddi DeepSpeech.

##### Ar gyfer defnyddio adnabod lleferydd o fewn Macsen:
```shell
root@6a88b0d59848:/DeepSpeech# ./bin/bangor_welsh/build_lm_scorer.sh -s /data/bangor/lm-data/macsen/corpus.clean.txt -t /data/bangor/testsets/data/macsen/deepspeech.csv -o /data/bangor/lm/macsen
```

##### Ar gyfer defnyddio adnabod lleferydd i drawsgrifio:
```shell
root@6a88b0d59848:/DeepSpeech# bin/bangor_welsh/build_lm_scorer.sh -s /data/macsen/corpus.clean.txt -o /data/macsen/ -t /data/macsen/deepspeech.csv
root@6a88b0d59848:/DeepSpeech# ./bin/bangor_welsh/build_lm_scorer.sh -s /data/bangor/lm-data/oscar/corpus.clean.txt -t /data/bangor/testsets/data/trawsgrifio/deepspeech.csv -o /data/bangor/lm/trawsgrifio
```


Expand All @@ -78,5 +87,5 @@ Bydd y sgript yma yn arbrofi gyda gwahanol baramedrau modelau iaith nes iddo ddo
Gall y broses gymryd amser hir - oriau neu ddiwrnod neu ddau - gan y bydd yn arbrofi miloedd o weithiau. Yn y diwedd, bydd y sgript yn adrodd ar ddau werth gorau posibl ac yn gofyn ichi eu mewnbynnu i'w cynnwys ym mhecyn terfynol y model iaith. (gweler y ffeil `kenlm.scorer` yn y cyfeiriadur a bennir gan y ddadl sgript` -l`)

```shell
root@6a88b0d59848:/DeepSpeech# bin/bangor_welsh/optimize_lm_scorer.sh -l /data/mascen -t /data/macsen/deepspeech.csv
root@6a88b0d59848:/DeepSpeech# bin/bangor_welsh/optimize_lm_scorer.sh -l /data/bangor/lm/mascen -t /data/bangor/testsets/data/macsen/deepspeech.csv
```
14 changes: 13 additions & 1 deletion local/analyze_audio.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,17 @@
def main(csv_root_dir, **args):
csv_files = pathlib.Path(csv_root_dir).glob("*.csv")

for csv_file_path in csv_files:
# client_id path sentence up_votes down_votes age gender accent locale segment
for csv_file_path in csv_files:

df = pandas.read_csv(csv_file_path, encoding='utf-8')
#
df_grouped = df.groupby("transcript").size().to_frame('count').reset_index()
df_grouped = df_grouped.sort_values("count", ascending=False)

df_grouped.to_csv(str(csv_file_path).replace(".csv",".dups.txt"))

#
total_duration = 0.0
count = 0
for index, row in df.iterrows():
Expand All @@ -25,6 +34,9 @@ def main(csv_root_dir, **args):
total_duration = total_duration + librosa.get_duration(filename=wav_file_path)

print ("%s\t%s recordings\t\t%.2f hours\t(%.2f seconds)" % (csv_file_path, count, total_duration/60.0/60.0, total_duration))
print (df_grouped.nlargest(n=5, columns='count'))
print ('\n')



if __name__ == "__main__":
Expand Down
13 changes: 8 additions & 5 deletions local/build_lm_scorer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ source_text_file=''
output_dir=''
test_files=''

VOCAB_SIZE=50000

alphabet_file_path=/DeepSpeech/bin/bangor_welsh/alphabet.txt
checkpoint_cy_dir=/checkpoints/cy

Expand Down Expand Up @@ -40,7 +42,7 @@ if [ -z "$output_dir" ]; then
exit 2
fi


mkdir -p ${output_dir}
cd ${output_dir}

set +x
Expand All @@ -49,11 +51,11 @@ echo "#### Generating binary language model
echo "####################################################################################"
set -x
python /DeepSpeech/data/lm/generate_lm.py \
--input_txt "${source_text_file}" \
--input_txt "${source_text_file}" \
--output_dir . \
--top_k 50000 \
--top_k ${VOCAB_SIZE} \
--kenlm_bins '/DeepSpeech/native_client/kenlm/build/bin/' \
--arpa_order 5 \
--arpa_order 6 \
--max_arpa_memory '85%' \
--arpa_prune "0|0|1" \
--binary_a_bits 255 \
Expand All @@ -62,6 +64,7 @@ python /DeepSpeech/data/lm/generate_lm.py \
--discount_fallback



set +x
echo "####################################################################################"
echo "#### Generating package for un-optimized language model package ####"
Expand All @@ -81,7 +84,7 @@ set -x
/DeepSpeech/native_client/generate_scorer_package \
--alphabet "${alphabet_file_path}" \
--lm lm.binary \
--vocab vocab-50000.txt \
--vocab vocab-${VOCAB_SIZE}.txt \
--package kenlm.scorer \
--default_alpha 0.75 \
--default_beta 1.85
Expand Down
47 changes: 47 additions & 0 deletions local/evalutate.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#!/bin/bash
set -e
set -u
set -o pipefail

testset_dir=''
scorer_path=''

alphabet_file_path=/DeepSpeech/bin/bangor_welsh/alphabet.txt
checkpoint_cy_dir=/checkpoints/cy

while getopts ":t:s:" opt; do
case $opt in
t)
testset_dir=$OPTARG
;;
s)
scorer_path=$OPTARG
;;
\?) echo "Invalid option -$OPTARG" >&2
;;
esac
done
shift "$(($OPTIND -1))"

if [ -z "$testset_dir" ]; then
echo "-t testset_dir not set (csv file containing speech test set)"
exit 2
fi
if [ -z "$scorer_path" ]; then
echo "-s scorer_path not set"
exit 2
fi


set +x
echo "####################################################################################"
echo "#### evaluating with transcriber testset ###"
echo "####################################################################################"
set -x

python -u /DeepSpeech/evaluate.py \
--test_files "${testset_dir}/data/trawsgrifio/OpiwHxPPqRI/deepspeech.csv" \
--test_batch_size 1 \
--alphabet_config_path "${alphabet_file_path}" \
--load_checkpoint_dir "${checkpoint_cy_dir}" \
--scorer_path "${scorer_path}"
Loading

0 comments on commit 5ee2465

Please sign in to comment.