implemented XOR splitting method of train,dev,test #58
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue #1
I change the split-up-process of data into training-, dev- and test-datasets in a way that no client_id will be in more than one dataset. I'll admit it does not look pretty and performance-wise the current split-mechanism with just sorting the dataset according to speaker-counts and splitting them just by number is much faster, but it get's the job done.
The splitting works in 2 steps:
creating a continuous index for every single client_id in the dataset
looping over the continous index and adding one client_id first to test, then to dev and the rest to train dataset until each dataset is filled up according to former calculated sizes
This way, each dataset contains roughly the prior calculated number of rows to resemble the targeted confidence of 99% with no intersections. Furthermore the train dataset still contains all the "power users" (i.e. all users with a lot of contributions to the dataset) with declining contribution-factor for dev and test datasets (in that order).
I did an analysis of how many intersections there are between datasets with the current method vs the new one presented in this PR: the old one had 3 intersections at most for any given language, most of the time there has only been an intersection between dev and test. You can have a look at this jupyter notebook for reference: https://github.com/simnotes/transcripts/blob/master/notebooks/xor_train_dev_test.ipynb
So maybe XOR'ing is not needed at all and we just can keep the old one.