Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implemented XOR splitting method of train,dev,test #58

Merged
merged 2 commits into from
Jan 29, 2019

Conversation

simnotes
Copy link
Collaborator

@simnotes simnotes commented Jan 25, 2019

Issue #1
I change the split-up-process of data into training-, dev- and test-datasets in a way that no client_id will be in more than one dataset. I'll admit it does not look pretty and performance-wise the current split-mechanism with just sorting the dataset according to speaker-counts and splitting them just by number is much faster, but it get's the job done.

The splitting works in 2 steps:
creating a continuous index for every single client_id in the dataset
looping over the continous index and adding one client_id first to test, then to dev and the rest to train dataset until each dataset is filled up according to former calculated sizes

This way, each dataset contains roughly the prior calculated number of rows to resemble the targeted confidence of 99% with no intersections. Furthermore the train dataset still contains all the "power users" (i.e. all users with a lot of contributions to the dataset) with declining contribution-factor for dev and test datasets (in that order).

I did an analysis of how many intersections there are between datasets with the current method vs the new one presented in this PR: the old one had 3 intersections at most for any given language, most of the time there has only been an intersection between dev and test. You can have a look at this jupyter notebook for reference: https://github.com/simnotes/transcripts/blob/master/notebooks/xor_train_dev_test.ipynb

So maybe XOR'ing is not needed at all and we just can keep the old one.

@kdavis-mozilla kdavis-mozilla self-requested a review January 27, 2019 21:42
@kdavis-mozilla
Copy link
Contributor

I've finally got some time to look at this. So I'll start the review.

I think having XOR'ing is, if not needed, a extremely beneficial attribute for the data sets.

Copy link
Contributor

@kdavis-mozilla kdavis-mozilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice addition! I just have one comment (bug?)

When adding to train there's no check that the new length of train is less than or equal to train_size.

It seems like such a check should be required as the length of valid may be larger than train_size + dev_size + test_size.

src/corporacreator/corpus.py Show resolved Hide resolved
Copy link
Contributor

@kdavis-mozilla kdavis-mozilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@simnotes simnotes merged commit b0cbeb8 into common-voice:master Jan 29, 2019
@simnotes simnotes deleted the xordatasets branch January 29, 2019 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants