Add default preprocessor, add check for existence of valid data #102

phirework · 2019-12-10T13:52:19Z

Seems like we've added a whole bunch of languages since the last dataset release and instead of copying the same boilerplate file 20 times and keeping track of new languages I added a default preprocessor that just spits back the sentence. Also deleted all the language preprocessors that were just the boilerplate - let me know if there was a specific reason these were in separate files. Also amended documentation to explain this.
We have 108 clips from vot locale (Votic) where none of it has any votes (because there are 20 living speakers in the world), which was breaking _calculate_data_set_sizes since the loop never runs and train_size et al are never instantiated, so I added a length check.
Added an exception handler to skip Attribute errors to deal with the case of a Corpus not having buckets as per above.

I am not a python dev, so please do let me know if there's a more pythonic way of handling any of the above.

johngian · 2020-01-09T15:01:06Z

Hi @phirework ! Here are some comments for the PR after your feedback request:

You are deleting some preprocessors (eg. br.py) completely but under preprocessors/__init__.py we still import them, which might raise an error.
Is there a way I can test it locally? I am not really familiar with the tool (I only run it once with en dataset).
There are some comments in-line. Might be a little nit-picky but I don't really know the context of the tool. Feel free to omit.
flake8 linter raises a lot of errors. Most of them were not introduced in this PR but it might worth it to check what was introduced in the diff.

src/corporacreator/corpus.py

phirework · 2020-01-20T21:58:07Z

Thanks for the review! Removed unused preprocessors.

There's no easy way to test locally, you could generate a live clips.tsv using the bundler (https://github.com/Common-Voice/common-voice-bundler/) and then running the result of that through this. I can also send you the full clips.tsv file I was working off of for the 2019 H2 corpus.

Good call on the lint issues, I'm going to file a separate ticket to clean that up.

phirework added 2 commits December 10, 2019 08:22

Add default preprocessor, add check for existence of valid data

8ff0877

Amend documentation to match changes

e2fe36b

johngian reviewed Jan 9, 2020

View reviewed changes

src/corporacreator/corpus.py Show resolved Hide resolved

src/corporacreator/corpus.py Outdated Show resolved Hide resolved

Remove unused imports

49b78fe

johngian approved these changes Jan 24, 2020

View reviewed changes

instantiate empty data frames before splitting

abbcfa9

phirework merged commit fe3e04d into common-voice:master Jun 15, 2020

phirework deleted the jz/default-preprocessor branch June 15, 2020 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add default preprocessor, add check for existence of valid data #102

Add default preprocessor, add check for existence of valid data #102

phirework commented Dec 10, 2019 •

edited

Loading

johngian commented Jan 9, 2020 •

edited

Loading

phirework commented Jan 20, 2020

Add default preprocessor, add check for existence of valid data #102

Add default preprocessor, add check for existence of valid data #102

Conversation

phirework commented Dec 10, 2019 • edited Loading

johngian commented Jan 9, 2020 • edited Loading

phirework commented Jan 20, 2020

phirework commented Dec 10, 2019 •

edited

Loading

johngian commented Jan 9, 2020 •

edited

Loading