implemented XOR splitting method of train,dev,test #58

simnotes · 2019-01-25T16:10:06Z

Issue #1
I change the split-up-process of data into training-, dev- and test-datasets in a way that no client_id will be in more than one dataset. I'll admit it does not look pretty and performance-wise the current split-mechanism with just sorting the dataset according to speaker-counts and splitting them just by number is much faster, but it get's the job done.

The splitting works in 2 steps:
creating a continuous index for every single client_id in the dataset
looping over the continous index and adding one client_id first to test, then to dev and the rest to train dataset until each dataset is filled up according to former calculated sizes

This way, each dataset contains roughly the prior calculated number of rows to resemble the targeted confidence of 99% with no intersections. Furthermore the train dataset still contains all the "power users" (i.e. all users with a lot of contributions to the dataset) with declining contribution-factor for dev and test datasets (in that order).

I did an analysis of how many intersections there are between datasets with the current method vs the new one presented in this PR: the old one had 3 intersections at most for any given language, most of the time there has only been an intersection between dev and test. You can have a look at this jupyter notebook for reference: https://github.com/simnotes/transcripts/blob/master/notebooks/xor_train_dev_test.ipynb

So maybe XOR'ing is not needed at all and we just can keep the old one.

kdavis-mozilla · 2019-01-29T12:40:08Z

I've finally got some time to look at this. So I'll start the review.

I think having XOR'ing is, if not needed, a extremely beneficial attribute for the data sets.

kdavis-mozilla

Really nice addition! I just have one comment (bug?)

When adding to train there's no check that the new length of train is less than or equal to train_size.

It seems like such a check should be required as the length of valid may be larger than train_size + dev_size + test_size.

src/corporacreator/corpus.py

kdavis-mozilla

LGTM

implemented XOR splitting method of train,dev,test

c324418

kdavis-mozilla self-requested a review January 27, 2019 21:42

kdavis-mozilla suggested changes Jan 29, 2019

View reviewed changes

src/corporacreator/corpus.py Show resolved Hide resolved

added restriction to length of train dataset size

69af6a8

kdavis-mozilla approved these changes Jan 29, 2019

View reviewed changes

simnotes merged commit b0cbeb8 into common-voice:master Jan 29, 2019

simnotes deleted the xordatasets branch January 29, 2019 15:04

simnotes mentioned this pull request Jan 29, 2019

Insure a user is in train, dev, xor test #1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implemented XOR splitting method of train,dev,test #58

implemented XOR splitting method of train,dev,test #58

simnotes commented Jan 25, 2019 •

edited

Loading

kdavis-mozilla commented Jan 29, 2019

kdavis-mozilla left a comment

kdavis-mozilla left a comment

implemented XOR splitting method of train,dev,test #58

implemented XOR splitting method of train,dev,test #58

Conversation

simnotes commented Jan 25, 2019 • edited Loading

kdavis-mozilla commented Jan 29, 2019

kdavis-mozilla left a comment

Choose a reason for hiding this comment

kdavis-mozilla left a comment

Choose a reason for hiding this comment

simnotes commented Jan 25, 2019 •

edited

Loading