-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure videos are allocated into all specified splits #169
Conversation
Codecov Report
@@ Coverage Diff @@
## master #169 +/- ##
======================================
Coverage 85.0% 85.0%
======================================
Files 30 30
Lines 1843 1851 +8
======================================
+ Hits 1567 1575 +8
Misses 276 276
|
@pjbull ready for another look. this now has the added benefit of doing split allocation within species which is something we had desired but hadn't yet implemented. this should help ensure better training results (in addition to fixing the bug) |
@pjbull ready for reals this time! |
list(values["split_proportions"].keys()), | ||
weights=list(values["split_proportions"].values()), | ||
k=len(species_df) - len(expected_splits), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has one weird edge case that is likely rare, so just worth filing an issue for:
v0.mp4, antelope
v1.mp4, antelope # v1 assigned test by antelope grouping
v1.mp4, cow # subsequently v1 assigned train by cow grouping
v2.mp4, antelope
v4.mp4, cow
v5.mp4, cow
# test set is now missing antelope
This PR fixes the issue that caused the following error:
The issue has to do with the
split
column. When we don't provide it in the labels, we use a random split. But we don't guarantee that we have all of the values from the split_proportions dict (e.g.train
andval
) in the split column. With a small number of videos and the default proportions, all videos getting assigned to train (see insplits.csv
). Since there are no val videos, there is noval_f1_score
.I considered implementing this without the random draw and instead calculating the number of samples that need to go into each split based on the proportions provided. However, this gets tricky to ensure based on rounding. I instead went with the while loop that tries different seeds until it finds a draw that works. If someone specifies split proportions of
train: 100, val: 1
, and only provides say 5 videos, there is risk of an infinite loop. I'm not sure how important it is to catch such a situation since 1) we don't expect people to be training on so few videos and 2) it feels hard to accidentally end up in such a situation.Bonus fix: fix bug in ModelCheckpoint which lets monitor be None if there is no early stopping.
mode
must either bemin
ormax
. I set it tomin
to match PTL's default but it doesn't get used if monitor is None.Plus cleanup to set defaults in
DummyTrainConfig
to avoid repeated code.