Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add default preprocessor, add check for existence of valid data #102

Merged
merged 4 commits into from
Jun 15, 2020
Merged

Add default preprocessor, add check for existence of valid data #102

merged 4 commits into from
Jun 15, 2020

Conversation

phirework
Copy link
Contributor

@phirework phirework commented Dec 10, 2019

  • Seems like we've added a whole bunch of languages since the last dataset release and instead of copying the same boilerplate file 20 times and keeping track of new languages I added a default preprocessor that just spits back the sentence. Also deleted all the language preprocessors that were just the boilerplate - let me know if there was a specific reason these were in separate files. Also amended documentation to explain this.
  • We have 108 clips from vot locale (Votic) where none of it has any votes (because there are 20 living speakers in the world), which was breaking _calculate_data_set_sizes since the loop never runs and train_size et al are never instantiated, so I added a length check.
  • Added an exception handler to skip Attribute errors to deal with the case of a Corpus not having buckets as per above.

I am not a python dev, so please do let me know if there's a more pythonic way of handling any of the above.

@johngian
Copy link

johngian commented Jan 9, 2020

Hi @phirework ! Here are some comments for the PR after your feedback request:

  • You are deleting some preprocessors (eg. br.py) completely but under preprocessors/__init__.py we still import them, which might raise an error.
  • Is there a way I can test it locally? I am not really familiar with the tool (I only run it once with en dataset).
  • There are some comments in-line. Might be a little nit-picky but I don't really know the context of the tool. Feel free to omit.
  • flake8 linter raises a lot of errors. Most of them were not introduced in this PR but it might worth it to check what was introduced in the diff.

src/corporacreator/corpus.py Show resolved Hide resolved
src/corporacreator/corpus.py Outdated Show resolved Hide resolved
@phirework
Copy link
Contributor Author

Thanks for the review! Removed unused preprocessors.

There's no easy way to test locally, you could generate a live clips.tsv using the bundler (https://github.com/Common-Voice/common-voice-bundler/) and then running the result of that through this. I can also send you the full clips.tsv file I was working off of for the 2019 H2 corpus.

Good call on the lint issues, I'm going to file a separate ticket to clean that up.

@phirework phirework merged commit fe3e04d into common-voice:master Jun 15, 2020
@phirework phirework deleted the jz/default-preprocessor branch June 15, 2020 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants