Skip to content
This repository has been archived by the owner on Nov 28, 2022. It is now read-only.

"our voices" competition submission by daniel d. hromada #5

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 2 additions & 10 deletions submit/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,2 @@
# Submission process

In order to submit your code for the competition, you should do the following:

- Fork this repository in GitHub.
- Create a new directory in the subdirectory that corresponds to the category you want to submit in.
- For example if you are submitting to the *Variant, Dialect or Accent* category, create your new directory under `Variant_Dialect_Accent`.
- Commit and push your code to that subdirectory in your fork.
- Open a pull request to this repository.

# Submission
While our submission is in certain sense relevant also to Method / Open categories, You will find it in Language Variant section.
45 changes: 45 additions & 0 deletions submit/Variant_Accent_Dialect/HighSorbian-band-A/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# DeepSpeech for HighSorbian (deepspeech-hsb)

Very briefly here: we took a deepspeech-cs checkpoint, originally trained on Czech audio recordings, transfer-learned it to our WesternSlavic alphabet.txt, trained some more epochs on Slovak-Czech-Slovak data to finally focus it on HSB in the following process:

* first 5 epochs of training on train (N=808) until validation with validation set (N=173) stops decreasing
* than 1 epoch of training on validation set
* than 1 epoch of training on both train + validation (N=971) set
* 1 epoch of training on train + validation + fragments \* set

Once this process is over, we get WER: 0.545715, CER: 0.218711, loss: 66.486069 on 450 recordings in testing CommonVoice sub-dataeset. Without fragment enrichment, results are WER: 0.568155, CER: 0.233091, loss: 69.035698.

# Model, scorer, alphabet
All files You need to make Your Deepspeech / coqui / [lesen-mikroserver](https://github.com/hromi/lesen-mikroserver) suite start processing HighSorbian are here <https://github.com/hromi/our-voices-model-competition/releases/tag/v0.0.1> (files prefixed with hsb-).

# Read before use
If You want to use this in Your system, make sure that You execute following non-ambigous substitutions, to make HSB consistent with WesternSlavic alphabet:
`
replace('ł','v')
replace('ć','ť')
replace('ń','ň')
replace('ź','zz')
`
for inputs into the system (c.f. hc_labels.py code snippet for import_cv2.py CommonVoice-to-DeepSpeech importer)
`
python3 ./bin/import_cv2.py --filter_alphabet ./alphabet.txt --validate_label_locale hsb_labels.py /data/CommonVoice/hsb
`

Conversely, before displaying the outputs of Your STT system , You will need to apply inverse transformations:
`
replace('v','ł')
replace('ť','ć')
replace('ň','ń')
replace('zz','ź')
`

to show the Sorbian person what he/she wants to see. (For slovaks and czechs, the whole thing is more readable with westernslavic alphabet).


# Curious ?
Please read this <https://github.com/hromi/our-voices-model-competition/tree/main/submit/Variant_Accent_Dialect/SlovakoCzech-band-C> to know more about why, what & how it all started.


\* we'll go into more detail concerning the fragment method in a related academic paper.


14 changes: 14 additions & 0 deletions submit/Variant_Accent_Dialect/HighSorbian-band-A/hsb_labels.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#from num2words import num2words
import re
def validate_label(label):
label=label.lower()
#try:
# label=num2words(label, lang='cz')
#except:
# 1
label=label.replace('ł','v')
label=label.replace('ć','ť')
label=label.replace('ń','ň')
label=label.replace('ź','zz')
label=re.sub('[^ abcdefghijklmnopqrstuvwxyzáéíóúýôäčďľěňŕšťžř]','',label)
return label # lower case valid labels
Loading