common-voice · hromi · Oct 17, 2022 · Oct 17, 2022 · Oct 17, 2022 · Oct 17, 2022
diff --git a/submit/README.md b/submit/README.md
@@ -1,10 +1,2 @@
-# Submission process
-
-In order to submit your code for the competition, you should do the following:
-
-- Fork this repository in GitHub.
-- Create a new directory in the subdirectory that corresponds to the category you want to submit in.
-  - For example if you are submitting to the *Variant, Dialect or Accent* category, create your new directory under `Variant_Dialect_Accent`. 
-- Commit and push your code to that subdirectory in your fork.
-- Open a pull request to this repository.
-
+# Submission
+While our submission is in certain sense relevant also to Method / Open categories, You will find it in Language Variant section.
diff --git a/submit/Variant_Accent_Dialect/HighSorbian-band-A/README.md b/submit/Variant_Accent_Dialect/HighSorbian-band-A/README.md
@@ -0,0 +1,45 @@
+# DeepSpeech for HighSorbian (deepspeech-hsb)
+
+Very briefly here: we took a deepspeech-cs checkpoint, originally trained on Czech audio recordings, transfer-learned it to our WesternSlavic alphabet.txt, trained some more epochs on Slovak-Czech-Slovak data to finally focus it on HSB in the following process:
+
+* first 5 epochs of training on train (N=808) until validation with validation set (N=173) stops decreasing
+* than 1 epoch of training on validation set
+* than 1 epoch of training on both train + validation (N=971) set
+* 1 epoch of training on train + validation + fragments \* set
+
+Once this process is over, we get WER: 0.545715, CER: 0.218711, loss: 66.486069  on 450 recordings in testing CommonVoice sub-dataeset. Without fragment enrichment, results are WER: 0.568155, CER: 0.233091, loss: 69.035698.
+
+# Model, scorer, alphabet
+All files You need to make Your Deepspeech / coqui / [lesen-mikroserver](https://github.com/hromi/lesen-mikroserver) suite start processing HighSorbian are here <https://github.com/hromi/our-voices-model-competition/releases/tag/v0.0.1> (files prefixed with hsb-).
+
+# Read before use
+If You want to use this in Your system, make sure that You execute following non-ambigous substitutions, to make HSB consistent with WesternSlavic alphabet:
+`
+replace('ł','v')
+replace('ć','ť')
+replace('ń','ň')
+replace('ź','zz')
+`
+for inputs into the system (c.f. hc_labels.py code snippet for import_cv2.py CommonVoice-to-DeepSpeech importer)
+`
+python3 ./bin/import_cv2.py --filter_alphabet ./alphabet.txt --validate_label_locale hsb_labels.py /data/CommonVoice/hsb
+`
+
+Conversely, before displaying the outputs of Your STT system , You will need to apply inverse transformations:
+`
+replace('v','ł')
+replace('ť','ć')
+replace('ň','ń')
+replace('zz','ź')
+`
+
+to show the Sorbian person what he/she wants to see. (For slovaks and czechs, the whole thing is more readable with westernslavic alphabet).
+
+
+# Curious ?
+Please read this <https://github.com/hromi/our-voices-model-competition/tree/main/submit/Variant_Accent_Dialect/SlovakoCzech-band-C> to know more about why, what & how it all started.
+
+
+\* we'll go into more detail concerning the fragment method in a related academic paper.
+
+
diff --git a/submit/Variant_Accent_Dialect/HighSorbian-band-A/hsb_labels.py b/submit/Variant_Accent_Dialect/HighSorbian-band-A/hsb_labels.py
@@ -0,0 +1,14 @@
+#from num2words import num2words
+import re
+def validate_label(label):
+    label=label.lower()
+    #try:
+    #    label=num2words(label, lang='cz')
+    #except:
+    #    1
+    label=label.replace('ł','v')
+    label=label.replace('ć','ť')
+    label=label.replace('ń','ň')
+    label=label.replace('ź','zz')
+    label=re.sub('[^ abcdefghijklmnopqrstuvwxyzáéíóúýôäčďľěňŕšťžř]','',label)
+    return label # lower case valid labels