[feat] Improve GroupByLabelBatchSampler #2788

fpgmaas · 2024-06-27T14:26:39Z

This PR does two things:

Fix the `GroupByLabelBatchSampler`

I believe this part of code to be incorrect;

    for column_name in valid_label_columns or []:
        if column_name in dataset.column_names:
            labels = dataset["label"]
            break
    else:
        raise ValueError(f"None of the valid_label_columns {valid_label_columns} are in the dataset.")

It seems to try and always use the "label" column, where I expect that the purpose of this code is to find the first value of valid_label_columns that is a valid column in the dataset. This PR fixes that, and make some other small adjustments, for example removing the lines of code mentioned here: #2782

Add unit tests

Current test coverage for sampler.py is 31%, this bumps that up a little bit :) In addition, I think having this unit test can also make it more clear for future developers to understand what the class is aiming to do.

tomaarsen

Thanks for this! I do think I'll need changes regarding the self.groups as discussed in that comment itself, but otherwise this is heading in the right direction!

sentence_transformers/sampler.py

Co-authored-by: Tom Aarsen <[email protected]>

sentence_transformers/sampler.py

tomaarsen

Small nitpick regarding the error

sentence_transformers/sampler.py

Co-authored-by: Tom Aarsen <[email protected]>

tomaarsen

Thanks a bunch for taking care of my review comments. After some nitpicky commit, I think this is all good to go (at least, once the tests go green)!

Tom Aarsen

Allow inheriting the Transformer class (UKPLab#2810) [`feat`] Add hard negatives mining utility (UKPLab#2768) * Add hard negatives mining utility * Add example datasets/models for hard negative mining tip * Update phrasing in dataset overview [chore] add test for NoDuplicatesBatchSampler (UKPLab#2795) * add test for NoDuplicatesBatchSampler * formatting * simplify tests [chore] Add test for RoundrobinBatchSampler (UKPLab#2798) * Add test for RoundrobinBatchSampler * fix test * improve RoundRobinBatchSampler and add additional test * Make datasets in ConcatDataset different sizes As the real "use case" of the RoundRobin sampler is to avoid sampling from one dataset more than from another. This is best tested when the datasets have different sizes. --------- Co-authored-by: Tom Aarsen <[email protected]> [feat] Improve GroupByLabelBatchSampler (UKPLab#2788) * Improve GroupByLabelBatchSampler * small fix * improve test * Update sentence_transformers/sampler.py Co-authored-by: Tom Aarsen <[email protected]> * fix sampler and add unit test * fix comment * remove .DS_Store * rm DS_Store * change self.groups statement * move to damplers dir * Update sentence_transformers/sampler.py Co-authored-by: Tom Aarsen <[email protected]> * Add typing --------- Co-authored-by: Tom Aarsen <[email protected]> Co-authored-by: Tom Aarsen <[email protected]> [`chore`] Clean-up `.gitignore` (UKPLab#2799) add test coverage command add to workflow fix cicd fix cicd fix leave cicd untouched fix gitignore fix gitignore update gitignore update gitignore fix gitignore fix gitignor

#2794) * Update outdated docs links Allow inheriting the Transformer class (#2810) [`feat`] Add hard negatives mining utility (#2768) * Add hard negatives mining utility * Add example datasets/models for hard negative mining tip * Update phrasing in dataset overview [chore] add test for NoDuplicatesBatchSampler (#2795) * add test for NoDuplicatesBatchSampler * formatting * simplify tests [chore] Add test for RoundrobinBatchSampler (#2798) * Add test for RoundrobinBatchSampler * fix test * improve RoundRobinBatchSampler and add additional test * Make datasets in ConcatDataset different sizes As the real "use case" of the RoundRobin sampler is to avoid sampling from one dataset more than from another. This is best tested when the datasets have different sizes. --------- Co-authored-by: Tom Aarsen <[email protected]> [feat] Improve GroupByLabelBatchSampler (#2788) * Improve GroupByLabelBatchSampler * small fix * improve test * Update sentence_transformers/sampler.py Co-authored-by: Tom Aarsen <[email protected]> * fix sampler and add unit test * fix comment * remove .DS_Store * rm DS_Store * change self.groups statement * move to damplers dir * Update sentence_transformers/sampler.py Co-authored-by: Tom Aarsen <[email protected]> * Add typing --------- Co-authored-by: Tom Aarsen <[email protected]> Co-authored-by: Tom Aarsen <[email protected]> [`chore`] Clean-up `.gitignore` (#2799) add test coverage command add to workflow fix cicd fix cicd fix leave cicd untouched fix gitignore fix gitignore update gitignore update gitignore fix gitignore fix gitignor * add command to open cov * fix setup.py * remove open command --------- Co-authored-by: Tom Aarsen <[email protected]>

fpgmaas added 3 commits June 27, 2024 16:26

Improve GroupByLabelBatchSampler

d32e5e3

small fix

3392009

improve test

811767c

fpgmaas mentioned this pull request Jun 27, 2024

GroupByLabelBatchSampler #2782

Open

waileong-leong approved these changes Jun 27, 2024

View reviewed changes

fpgmaas changed the title ~~Improve GroupByLabelBatchSampler~~ [feat] Improve GroupByLabelBatchSampler Jun 27, 2024

tomaarsen requested changes Jun 28, 2024

View reviewed changes

sentence_transformers/sampler.py Outdated Show resolved Hide resolved

sentence_transformers/sampler.py Outdated Show resolved Hide resolved

fpgmaas and others added 4 commits June 28, 2024 12:46

Update sentence_transformers/sampler.py

74f10d7

Co-authored-by: Tom Aarsen <[email protected]>

Merge branch 'UKPLab:master' into feature/improve-label-batch-sampler

f6f23ee

fix sampler and add unit test

7c71a7d

fix comment

a392305

fpgmaas requested a review from tomaarsen June 28, 2024 11:44

fpgmaas added 2 commits June 28, 2024 13:50

remove .DS_Store

d8aa896

rm DS_Store

6dca1e0

tomaarsen reviewed Jun 28, 2024

View reviewed changes

sentence_transformers/sampler.py Outdated Show resolved Hide resolved

fpgmaas added 2 commits June 28, 2024 14:42

change self.groups statement

b33586b

move to damplers dir

9a424ae

fpgmaas requested a review from tomaarsen June 28, 2024 20:29

tomaarsen requested changes Jul 5, 2024

View reviewed changes

sentence_transformers/sampler.py Outdated Show resolved Hide resolved

fpgmaas and others added 3 commits July 5, 2024 13:53

Update sentence_transformers/sampler.py

088d73c

Co-authored-by: Tom Aarsen <[email protected]>

Merge branch 'UKPLab:master' into feature/improve-label-batch-sampler

0defb79

Add typing

6b485b5

tomaarsen approved these changes Jul 9, 2024

View reviewed changes

tomaarsen merged commit c05b105 into UKPLab:master Jul 9, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Improve GroupByLabelBatchSampler #2788

[feat] Improve GroupByLabelBatchSampler #2788

fpgmaas commented Jun 27, 2024 •

edited

Loading

tomaarsen left a comment

tomaarsen left a comment

tomaarsen left a comment

[feat] Improve GroupByLabelBatchSampler #2788

[feat] Improve GroupByLabelBatchSampler #2788

Conversation

fpgmaas commented Jun 27, 2024 • edited Loading

Fix the GroupByLabelBatchSampler

Add unit tests

tomaarsen left a comment

Choose a reason for hiding this comment

tomaarsen left a comment

Choose a reason for hiding this comment

tomaarsen left a comment

Choose a reason for hiding this comment

fpgmaas commented Jun 27, 2024 •

edited

Loading

Fix the `GroupByLabelBatchSampler`