gwellianau profi a CV6 / testing improvements and CV6

techiaith · Jan 25, 2021 · 5ee2465 · 5ee2465
1 parent 2b17bc3
commit 5ee2465
Show file tree

Hide file tree

Showing 13 changed files with 430 additions and 132 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -9,13 +9,15 @@ RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.d
 	&& apt-get clean \
 	&& git lfs install \
 	&& pip install sox wget sklearn pandas python_speech_features virtualenv \ 
-				   webrtcvad requests tqdm columnize praatio srt \
+				   webrtcvad requests tqdm columnize praatio srt GitPython pydub \
 	&& rm -rf /var/lib/apt/lists/* 
 
+
 ENV LC_ALL cy_GB.UTF-8
 ENV LANG cy_GB.UTF-8
 ENV LANGUAGE cy_GB.UTF-8
 
+
 #
 WORKDIR /DeepSpeech/native_client
 

diff --git a/Makefile b/Makefile
@@ -1,7 +1,7 @@
 default: build
 
 DEEPSPEECH_RELEASE := 0.9.1
-TECHIAITH_RELEASE := 20.12
+TECHIAITH_RELEASE := 21.01
 
 run: 
 	docker run --gpus all --name techiaith-deepspeech-v${DEEPSPEECH_RELEASE}-${USER} -it \

diff --git a/local/README.md b/local/README.md
@@ -9,21 +9,23 @@ Mae'r sgriptiau canlynol yn cysylltu ag yn hwyluso'r holl gamau a ddilynir er mw
 
 ## Rhagofynion
 
-Llwythwch i lawr data lleferydd Cymraeg o wefan CommonVoice: https://voice.mozilla.org/cy/datasets sy'n cael ei ddarparu fel un ffeil mawr wedi'i gwasgu (e.e. `cy.tar.gz`) . Cadwch y ffeil o fewn y ffolder `data`. 
+Llwythwch i lawr data lleferydd Cymraeg o wefan CommonVoice: https://voice.mozilla.org/cy/datasets sy'n cael ei ddarparu fel un ffeil mawr wedi'i gwasgu (e.e. `cy.tar.gz`) . Cadwch y ffeil o fewn y ffolder `data/commonvoice`. 
+
+Llwythwch i lawr hefyd Corpws OSCAR o https://oscar-public.huma-num.fr/shuff-orig/cy sy'n cynnwys testunau Cymraeg o'r we. Bydd angen cofrestru er mwyn i'r wefan caniatau lawrlwytho. Cadwch y ffeil o fewn y ffolder `data/oscar`. 
 
 
 ## Paratoi Data
 
 ### `import_audio_archive.py`
 
 ```shell
-root@c67722092f2e:/DeepSpeech# bin/bangor_welsh/import_audio_archive.py --archive cy.tar.gz --target_dir /data/commonvoice-cy-v5-20200622/
+root@c67722092f2e:/DeepSpeech# bin/bangor_welsh/import_audio_archive.py --archive /data/commonvoice/cy.tar.gz --target_dir /data/commonvoice/
 ```
 
 ### `analyze_audio.py`
 
 ```shell
-root@c67722092f2e:/DeepSpeech# /DeepSpeech/bin/bangor_welsh/analyze_audio.py --csv_dir /data/commonvoice-cy-v5-20200622/clips/
+root@c67722092f2e:/DeepSpeech# /DeepSpeech/bin/bangor_welsh/analyze_audio.py --csv_dir /data/commonvoice/clips/
 /data/commonvoice-cy-v5-20200622/clips/dev.csv                0.91 hours      (3269.93 seconds)
 /data/commonvoice-cy-v5-20200622/clips/test.csv               0.98 hours      (3514.49 seconds)
 /data/commonvoice-cy-v5-20200622/clips/train.csv              1.09 hours      (3941.04 seconds)
@@ -41,7 +43,7 @@ Defnyddiwch y sgript ganlynol i hyfforddi model acwstig. Dyle paramedr `-a` nodi
 Mae'r sgript hon yn defnyddio nodwedd dysgu trosglwyddol (*transfer learning*) DeepSpeech er mwyn cael fudd o ddefnyddio modelau acwstig Saesneg Mozilla, sydd wedi'u hyfforddi ar gasgliadau data llawer mwy o sain, fel man cychwyn ar gyfer hyfforddi adnabod lleferydd Cymraeg.
 
 ```shell
-root@c67722092f2e:/DeepSpeech# /DeepSpeech/bin/bangor_welsh/run_tl_cv_cy.sh -a /data/commonvoice-cy-v5-20200622/clips
+root@c67722092f2e:/DeepSpeech# /DeepSpeech/bin/bangor_welsh/run_tl_cv_cy.sh -a /data/commonvoice/clips
 ```
 
 
@@ -53,20 +55,27 @@ Nid yw model acwstig ar ei ben ei hun, er ei fod wedi defnyddio technegau dysgu
 
 Mae angen rhagor o adnoddau gan Brifysgol Bangor er mwyn hyfforddi DeepSpeech ar gyfer adnabod lleferydd Cymraeg mewn gwahanol gyd-destunau defnyddiol. 
 
-Mae'r sgript isod yn llwytho i lawr rhagor o recordiadau a corpora testun sydd yn galluogi adnabod lleferydd Cymraeg o fewn cynorthwyydd digidol ('macsen') neu drawsgrifiwr ('transcribe') (fel yr ofynnir ym mharamedr `-d`)
+Mae'r sgript isod yn llwytho i lawr rhagor o recordiadau a corpora testun sydd yn galluogi adnabod lleferydd Cymraeg o fewn cynorthwyydd digidol a trawsgrifiwr. Rhaid i chi llwytho i lawr ffeil archif corpws testun OSCAR o flaen llaw er mwyn ei ddefnyddio gyda'r orchymyn isod:
 
 ```shell
-root@6a88b0d59848:/DeepSpeech# bin/bangor_welsh/import_bangor_resources.py -t /data/macsen -d macsen
+root@6a88b0d59848:/DeepSpeech# bin/bangor_welsh/import_bangor_resources.py -o /data/oscar/cy.txt.gzip -c /data/commonvoice/validated.tsv
 ```
 
+Mae'r sgript mewnforio hefyd yn hidlo unrhyw testunau sy'n anaddas i'r proses hyfforddi modelau iaith adnabod lleferydd ac yn creu copi 'glan' (`.clean`) o'r corpws. 
 
 
 ### `build_lm_scorer.sh`
 
 Dyma'r brif sgript ar gyfer hyfforddi model iaith ac yna ei werthuso gyda model acwstig o gamau blaenorol hyfforddi DeepSpeech. 
 
+##### Ar gyfer defnyddio adnabod lleferydd o fewn Macsen:
+```shell
+root@6a88b0d59848:/DeepSpeech# ./bin/bangor_welsh/build_lm_scorer.sh -s /data/bangor/lm-data/macsen/corpus.clean.txt -t /data/bangor/testsets/data/macsen/deepspeech.csv -o /data/bangor/lm/macsen
+```
+
+##### Ar gyfer defnyddio adnabod lleferydd i drawsgrifio:
 ```shell
-root@6a88b0d59848:/DeepSpeech# bin/bangor_welsh/build_lm_scorer.sh -s /data/macsen/corpus.clean.txt -o /data/macsen/ -t /data/macsen/deepspeech.csv
+root@6a88b0d59848:/DeepSpeech# ./bin/bangor_welsh/build_lm_scorer.sh -s /data/bangor/lm-data/oscar/corpus.clean.txt -t /data/bangor/testsets/data/trawsgrifio/deepspeech.csv -o /data/bangor/lm/trawsgrifio
 ```
 
 
@@ -78,5 +87,5 @@ Bydd y sgript yma yn arbrofi gyda gwahanol baramedrau modelau iaith nes iddo ddo
 Gall y broses gymryd amser hir - oriau neu ddiwrnod neu ddau - gan y bydd yn arbrofi miloedd o weithiau. Yn y diwedd, bydd y sgript yn adrodd ar ddau werth gorau posibl ac yn gofyn ichi eu mewnbynnu i'w cynnwys ym mhecyn terfynol y model iaith. (gweler y ffeil `kenlm.scorer` yn y cyfeiriadur a bennir gan y ddadl sgript` -l`)
 
 ```shell
-root@6a88b0d59848:/DeepSpeech# bin/bangor_welsh/optimize_lm_scorer.sh -l /data/mascen -t /data/macsen/deepspeech.csv
+root@6a88b0d59848:/DeepSpeech# bin/bangor_welsh/optimize_lm_scorer.sh -l /data/bangor/lm/mascen -t /data/bangor/testsets/data/macsen/deepspeech.csv
 ```
diff --git a/local/analyze_audio.py b/local/analyze_audio.py
@@ -15,8 +15,17 @@
 def main(csv_root_dir, **args):
     csv_files = pathlib.Path(csv_root_dir).glob("*.csv")
 
-    for csv_file_path in csv_files:        
+    # client_id	path	sentence	up_votes	down_votes	age	gender	accent	locale	segment
+    for csv_file_path in csv_files:
+
         df = pandas.read_csv(csv_file_path, encoding='utf-8')        
+        #
+        df_grouped = df.groupby("transcript").size().to_frame('count').reset_index()
+        df_grouped = df_grouped.sort_values("count", ascending=False)
+
+        df_grouped.to_csv(str(csv_file_path).replace(".csv",".dups.txt"))
+
+        #        
         total_duration = 0.0
         count = 0
         for index, row in df.iterrows():
@@ -25,6 +34,9 @@ def main(csv_root_dir, **args):
             total_duration = total_duration + librosa.get_duration(filename=wav_file_path)
 
         print ("%s\t%s recordings\t\t%.2f hours\t(%.2f seconds)" % (csv_file_path, count, total_duration/60.0/60.0, total_duration))
+        print (df_grouped.nlargest(n=5, columns='count'))
+        print ('\n')
+
 
 
 if __name__ == "__main__": 

diff --git a/local/build_lm_scorer.sh b/local/build_lm_scorer.sh
@@ -7,6 +7,8 @@ source_text_file=''
 output_dir=''
 test_files=''
 
+VOCAB_SIZE=50000
+
 alphabet_file_path=/DeepSpeech/bin/bangor_welsh/alphabet.txt
 checkpoint_cy_dir=/checkpoints/cy
 
@@ -40,7 +42,7 @@ if [ -z "$output_dir" ]; then
    	exit 2
 fi
 
-
+mkdir -p ${output_dir}
 cd ${output_dir}
 
 set +x
@@ -49,11 +51,11 @@ echo "#### Generating binary language model
 echo "####################################################################################"
 set -x
 python /DeepSpeech/data/lm/generate_lm.py \
-	--input_txt "${source_text_file}" \
+  --input_txt "${source_text_file}" \
   --output_dir . \
-  --top_k 50000 \
+  --top_k ${VOCAB_SIZE} \
   --kenlm_bins '/DeepSpeech/native_client/kenlm/build/bin/' \
-  --arpa_order 5 \
+  --arpa_order 6 \
   --max_arpa_memory '85%' \
   --arpa_prune "0|0|1" \
   --binary_a_bits 255 \
@@ -62,6 +64,7 @@ python /DeepSpeech/data/lm/generate_lm.py \
   --discount_fallback
 
 
+
 set +x
 echo "####################################################################################"
 echo "#### Generating package for un-optimized language model package                 ####"
@@ -81,7 +84,7 @@ set -x
 /DeepSpeech/native_client/generate_scorer_package \
 	--alphabet "${alphabet_file_path}" \
 	--lm lm.binary \
-	--vocab vocab-50000.txt \
+	--vocab vocab-${VOCAB_SIZE}.txt \
 	--package kenlm.scorer \
  	--default_alpha 0.75 \
 	--default_beta 1.85

diff --git a/local/evalutate.sh b/local/evalutate.sh
@@ -0,0 +1,47 @@
+#!/bin/bash
+set -e
+set -u
+set -o pipefail
+
+testset_dir=''
+scorer_path=''
+
+alphabet_file_path=/DeepSpeech/bin/bangor_welsh/alphabet.txt
+checkpoint_cy_dir=/checkpoints/cy
+
+while getopts ":t:s:" opt; do
+  case $opt in    
+    t)
+        testset_dir=$OPTARG
+        ;;
+	  s)
+		    scorer_path=$OPTARG
+		    ;;    
+    \?) echo "Invalid option -$OPTARG" >&2
+    ;;
+  esac
+done
+shift "$(($OPTIND -1))"
+
+if [ -z "$testset_dir" ]; then
+    echo "-t testset_dir not set (csv file containing speech test set)"
+   	exit 2
+fi
+if [ -z "$scorer_path" ]; then
+    echo "-s scorer_path not set"
+   	exit 2
+fi
+
+
+set +x
+echo "####################################################################################"
+echo "#### evaluating with transcriber testset                											   ###"
+echo "####################################################################################"
+set -x
+
+python -u /DeepSpeech/evaluate.py \
+	--test_files "${testset_dir}/data/trawsgrifio/OpiwHxPPqRI/deepspeech.csv" \
+  --test_batch_size 1 \
+	--alphabet_config_path "${alphabet_file_path}" \
+	--load_checkpoint_dir "${checkpoint_cy_dir}" \
+	--scorer_path "${scorer_path}"