-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] Improve GroupByLabelBatchSampler #2788
[feat] Improve GroupByLabelBatchSampler #2788
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! I do think I'll need changes regarding the self.groups
as discussed in that comment itself, but otherwise this is heading in the right direction!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nitpick regarding the error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a bunch for taking care of my review comments. After some nitpicky commit, I think this is all good to go (at least, once the tests go green)!
- Tom Aarsen
Allow inheriting the Transformer class (UKPLab#2810) [`feat`] Add hard negatives mining utility (UKPLab#2768) * Add hard negatives mining utility * Add example datasets/models for hard negative mining tip * Update phrasing in dataset overview [chore] add test for NoDuplicatesBatchSampler (UKPLab#2795) * add test for NoDuplicatesBatchSampler * formatting * simplify tests [chore] Add test for RoundrobinBatchSampler (UKPLab#2798) * Add test for RoundrobinBatchSampler * fix test * improve RoundRobinBatchSampler and add additional test * Make datasets in ConcatDataset different sizes As the real "use case" of the RoundRobin sampler is to avoid sampling from one dataset more than from another. This is best tested when the datasets have different sizes. --------- Co-authored-by: Tom Aarsen <[email protected]> [feat] Improve GroupByLabelBatchSampler (UKPLab#2788) * Improve GroupByLabelBatchSampler * small fix * improve test * Update sentence_transformers/sampler.py Co-authored-by: Tom Aarsen <[email protected]> * fix sampler and add unit test * fix comment * remove .DS_Store * rm DS_Store * change self.groups statement * move to damplers dir * Update sentence_transformers/sampler.py Co-authored-by: Tom Aarsen <[email protected]> * Add typing --------- Co-authored-by: Tom Aarsen <[email protected]> Co-authored-by: Tom Aarsen <[email protected]> [`chore`] Clean-up `.gitignore` (UKPLab#2799) add test coverage command add to workflow fix cicd fix cicd fix leave cicd untouched fix gitignore fix gitignore update gitignore update gitignore fix gitignore fix gitignor
Allow inheriting the Transformer class (UKPLab#2810) [`feat`] Add hard negatives mining utility (UKPLab#2768) * Add hard negatives mining utility * Add example datasets/models for hard negative mining tip * Update phrasing in dataset overview [chore] add test for NoDuplicatesBatchSampler (UKPLab#2795) * add test for NoDuplicatesBatchSampler * formatting * simplify tests [chore] Add test for RoundrobinBatchSampler (UKPLab#2798) * Add test for RoundrobinBatchSampler * fix test * improve RoundRobinBatchSampler and add additional test * Make datasets in ConcatDataset different sizes As the real "use case" of the RoundRobin sampler is to avoid sampling from one dataset more than from another. This is best tested when the datasets have different sizes. --------- Co-authored-by: Tom Aarsen <[email protected]> [feat] Improve GroupByLabelBatchSampler (UKPLab#2788) * Improve GroupByLabelBatchSampler * small fix * improve test * Update sentence_transformers/sampler.py Co-authored-by: Tom Aarsen <[email protected]> * fix sampler and add unit test * fix comment * remove .DS_Store * rm DS_Store * change self.groups statement * move to damplers dir * Update sentence_transformers/sampler.py Co-authored-by: Tom Aarsen <[email protected]> * Add typing --------- Co-authored-by: Tom Aarsen <[email protected]> Co-authored-by: Tom Aarsen <[email protected]> [`chore`] Clean-up `.gitignore` (UKPLab#2799) add test coverage command add to workflow fix cicd fix cicd fix leave cicd untouched fix gitignore fix gitignore update gitignore update gitignore fix gitignore fix gitignor
Allow inheriting the Transformer class (UKPLab#2810) [`feat`] Add hard negatives mining utility (UKPLab#2768) * Add hard negatives mining utility * Add example datasets/models for hard negative mining tip * Update phrasing in dataset overview [chore] add test for NoDuplicatesBatchSampler (UKPLab#2795) * add test for NoDuplicatesBatchSampler * formatting * simplify tests [chore] Add test for RoundrobinBatchSampler (UKPLab#2798) * Add test for RoundrobinBatchSampler * fix test * improve RoundRobinBatchSampler and add additional test * Make datasets in ConcatDataset different sizes As the real "use case" of the RoundRobin sampler is to avoid sampling from one dataset more than from another. This is best tested when the datasets have different sizes. --------- Co-authored-by: Tom Aarsen <[email protected]> [feat] Improve GroupByLabelBatchSampler (UKPLab#2788) * Improve GroupByLabelBatchSampler * small fix * improve test * Update sentence_transformers/sampler.py Co-authored-by: Tom Aarsen <[email protected]> * fix sampler and add unit test * fix comment * remove .DS_Store * rm DS_Store * change self.groups statement * move to damplers dir * Update sentence_transformers/sampler.py Co-authored-by: Tom Aarsen <[email protected]> * Add typing --------- Co-authored-by: Tom Aarsen <[email protected]> Co-authored-by: Tom Aarsen <[email protected]> [`chore`] Clean-up `.gitignore` (UKPLab#2799) add test coverage command add to workflow fix cicd fix cicd fix leave cicd untouched fix gitignore fix gitignore update gitignore update gitignore fix gitignore fix gitignor
#2794) * Update outdated docs links Allow inheriting the Transformer class (#2810) [`feat`] Add hard negatives mining utility (#2768) * Add hard negatives mining utility * Add example datasets/models for hard negative mining tip * Update phrasing in dataset overview [chore] add test for NoDuplicatesBatchSampler (#2795) * add test for NoDuplicatesBatchSampler * formatting * simplify tests [chore] Add test for RoundrobinBatchSampler (#2798) * Add test for RoundrobinBatchSampler * fix test * improve RoundRobinBatchSampler and add additional test * Make datasets in ConcatDataset different sizes As the real "use case" of the RoundRobin sampler is to avoid sampling from one dataset more than from another. This is best tested when the datasets have different sizes. --------- Co-authored-by: Tom Aarsen <[email protected]> [feat] Improve GroupByLabelBatchSampler (#2788) * Improve GroupByLabelBatchSampler * small fix * improve test * Update sentence_transformers/sampler.py Co-authored-by: Tom Aarsen <[email protected]> * fix sampler and add unit test * fix comment * remove .DS_Store * rm DS_Store * change self.groups statement * move to damplers dir * Update sentence_transformers/sampler.py Co-authored-by: Tom Aarsen <[email protected]> * Add typing --------- Co-authored-by: Tom Aarsen <[email protected]> Co-authored-by: Tom Aarsen <[email protected]> [`chore`] Clean-up `.gitignore` (#2799) add test coverage command add to workflow fix cicd fix cicd fix leave cicd untouched fix gitignore fix gitignore update gitignore update gitignore fix gitignore fix gitignor * add command to open cov * fix setup.py * remove open command --------- Co-authored-by: Tom Aarsen <[email protected]>
This PR does two things:
Fix the
GroupByLabelBatchSampler
I believe this part of code to be incorrect;
It seems to try and always use the
"label"
column, where I expect that the purpose of this code is to find the first value ofvalid_label_columns
that is a valid column in the dataset. This PR fixes that, and make some other small adjustments, for example removing the lines of code mentioned here: #2782Add unit tests
Current test coverage for
sampler.py
is 31%, this bumps that up a little bit :) In addition, I think having this unit test can also make it more clear for future developers to understand what the class is aiming to do.