-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial script for automating the creation of a controlled testing en… #2057
base: main
Are you sure you want to change the base?
Conversation
…vironment for OOVs
create_oovs.sh
Outdated
#!/bin/bash | ||
set -e | ||
|
||
stag=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused?
create_oovs.sh
Outdated
exit 1 | ||
fi | ||
|
||
step=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this shouldn't be hardcoded? Or is it meant as a development tool, so you iterate on the parts as you get them working?
create_oovs.sh
Outdated
echo "Step 1: Preparing Data" | ||
if [ $step -le 1 ]; then | ||
|
||
# Extract corpus unique vocabularies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Extract corpus unique vocabularies | |
# Extract corpus vocabulary (unique words) |
create_oovs.sh
Outdated
sed 's/ /\n/g' tmp/data.txt | sort | uniq -c | sort -nr > tmp/vocab.txt | ||
grep -o . tmp/vocab.txt | sort -u > tmp/alphabet.txt | ||
|
||
# Pick the least frequent 10% vocabularies to represent OOVs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Pick the least frequent 10% vocabularies to represent OOVs | |
# Pick the least frequent 10% words to build OOV set |
create_oovs.sh
Outdated
grep -o . tmp/vocab.txt | sort -u > tmp/alphabet.txt | ||
|
||
# Pick the least frequent 10% vocabularies to represent OOVs | ||
oov_count=$(wc tmp/vocab.txt | awk '{print int($0*0.1)}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use wc -l
to communicate intent earlier.
oov_count=$(wc tmp/vocab.txt | awk '{print int($0*0.1)}') | |
oov_count=$(wc -l tmp/vocab.txt | awk '{print int($0*0.1)}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Size of OOV set should be a parameter (with default).
create_oovs.sh
Outdated
|
||
# Prepare OOV csv for testing purposes (to assess imporvements on it) | ||
grep -wFf tmp/oov_sents tmp/data.txt > tmp/oov_corpus.txt | ||
grep -wFf tmp/oov_sents $data | sed '1 i\wav_filename,wav_filesize,transcript' > tmp/oov_corpus.csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sed
command doesn't work on macOS:
sed: 1: "1 i\wav_filename,wav_fi ...": extra characters after \ at the end of i command
Can we make it portable to BSD sed? This fix worked for me:
grep -wFf tmp/oov_sents $data | sed '1 i\wav_filename,wav_filesize,transcript' > tmp/oov_corpus.csv | |
echo "wav_filename,wav_filesize,transcript" > tmp/oov_corpus.csv | |
grep -wFf tmp/oov_sents $data >> tmp/oov_corpus.csv |
create_oovs.sh
Outdated
fi | ||
|
||
# Generate LM | ||
echo "Step 2: Generaing Language Model" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
echo "Step 2: Generaing Language Model" | |
echo "Step 2: Generating Language Model" |
create_oovs.sh
Outdated
gzip -c tmp/scorer_corpus.txt > tmp/scorer_corpus.txt.gz | ||
grep -vf tmp/oov_sents $data > tmp/scorer_corpus.csv | ||
|
||
# Prepare OOV csv for testing purposes (to assess imporvements on it) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Prepare OOV csv for testing purposes (to assess imporvements on it) | |
# Prepare OOV CSV for testing purposes (to assess improvements on it) |
create_oovs.sh
Outdated
echo "Evaluating on OOV testing set." | ||
python -m coqui_stt_training.evaluate --test_files tmp/oov_corpus.csv \ | ||
--test_output_file tmp/results/oov_results.json --scorer_path native_client/kenlm.scorer \ | ||
--checkpoint_dir /home/aya/work/tmp/AM/coqui-stt-1.1.0-checkpoint --test_batch_size $nj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkpoint path should be made into a parameter.
create_oovs.sh
Outdated
if [ $step -le 3 ]; then | ||
echo "Evaluating on OOV testing set." | ||
python -m coqui_stt_training.evaluate --test_files tmp/oov_corpus.csv \ | ||
--test_output_file tmp/results/oov_results.json --scorer_path native_client/kenlm.scorer \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The native_client/kenlm.scorer
should be kenlm.scorer
, according to the command in the step above, right? And that should probably be changed to tmp/kenlm.scorer
to keep all the outputs of the script contained to that folder.
No description provided.