Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Improve GroupByLabelBatchSampler #2788

Merged
merged 14 commits into from
Jul 9, 2024

Conversation

fpgmaas
Copy link
Contributor

@fpgmaas fpgmaas commented Jun 27, 2024

This PR does two things:

Fix the GroupByLabelBatchSampler

I believe this part of code to be incorrect;

    for column_name in valid_label_columns or []:
        if column_name in dataset.column_names:
            labels = dataset["label"]
            break
    else:
        raise ValueError(f"None of the valid_label_columns {valid_label_columns} are in the dataset.")

It seems to try and always use the "label" column, where I expect that the purpose of this code is to find the first value of valid_label_columns that is a valid column in the dataset. This PR fixes that, and make some other small adjustments, for example removing the lines of code mentioned here: #2782

Add unit tests

Current test coverage for sampler.py is 31%, this bumps that up a little bit :) In addition, I think having this unit test can also make it more clear for future developers to understand what the class is aiming to do.

@fpgmaas fpgmaas changed the title Improve GroupByLabelBatchSampler [feat] Improve GroupByLabelBatchSampler Jun 27, 2024
Copy link
Collaborator

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I do think I'll need changes regarding the self.groups as discussed in that comment itself, but otherwise this is heading in the right direction!

sentence_transformers/sampler.py Outdated Show resolved Hide resolved
sentence_transformers/sampler.py Outdated Show resolved Hide resolved
@fpgmaas fpgmaas requested a review from tomaarsen June 28, 2024 11:44
@fpgmaas fpgmaas requested a review from tomaarsen June 28, 2024 20:29
Copy link
Collaborator

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nitpick regarding the error

sentence_transformers/sampler.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a bunch for taking care of my review comments. After some nitpicky commit, I think this is all good to go (at least, once the tests go green)!

  • Tom Aarsen

@tomaarsen tomaarsen merged commit c05b105 into UKPLab:master Jul 9, 2024
11 checks passed
fpgmaas pushed a commit to fpgmaas/sentence-transformers that referenced this pull request Jul 9, 2024
Allow inheriting the Transformer class (UKPLab#2810)

[`feat`] Add hard negatives mining utility (UKPLab#2768)

* Add hard negatives mining utility

* Add example datasets/models for hard negative mining tip

* Update phrasing in dataset overview

[chore] add test for NoDuplicatesBatchSampler (UKPLab#2795)

* add test for NoDuplicatesBatchSampler

* formatting

* simplify tests

[chore] Add test for RoundrobinBatchSampler (UKPLab#2798)

* Add test for RoundrobinBatchSampler

* fix test

* improve RoundRobinBatchSampler and add additional test

* Make datasets in ConcatDataset different sizes

As the real "use case" of the RoundRobin sampler is to avoid sampling from one dataset more than from another. This is best tested when the datasets have different sizes.

---------

Co-authored-by: Tom Aarsen <[email protected]>

[feat] Improve GroupByLabelBatchSampler (UKPLab#2788)

* Improve GroupByLabelBatchSampler

* small fix

* improve test

* Update sentence_transformers/sampler.py

Co-authored-by: Tom Aarsen <[email protected]>

* fix sampler and add unit test

* fix comment

* remove .DS_Store

* rm DS_Store

* change self.groups statement

* move to damplers dir

* Update sentence_transformers/sampler.py

Co-authored-by: Tom Aarsen <[email protected]>

* Add typing

---------

Co-authored-by: Tom Aarsen <[email protected]>
Co-authored-by: Tom Aarsen <[email protected]>

[`chore`] Clean-up `.gitignore` (UKPLab#2799)

add test coverage command

add to workflow

fix cicd

fix cicd

fix

leave cicd untouched

fix gitignore

fix gitignore

update gitignore

update gitignore

fix gitignore

fix gitignor
fpgmaas pushed a commit to fpgmaas/sentence-transformers that referenced this pull request Jul 9, 2024
Allow inheriting the Transformer class (UKPLab#2810)

[`feat`] Add hard negatives mining utility (UKPLab#2768)

* Add hard negatives mining utility

* Add example datasets/models for hard negative mining tip

* Update phrasing in dataset overview

[chore] add test for NoDuplicatesBatchSampler (UKPLab#2795)

* add test for NoDuplicatesBatchSampler

* formatting

* simplify tests

[chore] Add test for RoundrobinBatchSampler (UKPLab#2798)

* Add test for RoundrobinBatchSampler

* fix test

* improve RoundRobinBatchSampler and add additional test

* Make datasets in ConcatDataset different sizes

As the real "use case" of the RoundRobin sampler is to avoid sampling from one dataset more than from another. This is best tested when the datasets have different sizes.

---------

Co-authored-by: Tom Aarsen <[email protected]>

[feat] Improve GroupByLabelBatchSampler (UKPLab#2788)

* Improve GroupByLabelBatchSampler

* small fix

* improve test

* Update sentence_transformers/sampler.py

Co-authored-by: Tom Aarsen <[email protected]>

* fix sampler and add unit test

* fix comment

* remove .DS_Store

* rm DS_Store

* change self.groups statement

* move to damplers dir

* Update sentence_transformers/sampler.py

Co-authored-by: Tom Aarsen <[email protected]>

* Add typing

---------

Co-authored-by: Tom Aarsen <[email protected]>
Co-authored-by: Tom Aarsen <[email protected]>

[`chore`] Clean-up `.gitignore` (UKPLab#2799)

add test coverage command

add to workflow

fix cicd

fix cicd

fix

leave cicd untouched

fix gitignore

fix gitignore

update gitignore

update gitignore

fix gitignore

fix gitignor
fpgmaas pushed a commit to fpgmaas/sentence-transformers that referenced this pull request Jul 9, 2024
Allow inheriting the Transformer class (UKPLab#2810)

[`feat`] Add hard negatives mining utility (UKPLab#2768)

* Add hard negatives mining utility

* Add example datasets/models for hard negative mining tip

* Update phrasing in dataset overview

[chore] add test for NoDuplicatesBatchSampler (UKPLab#2795)

* add test for NoDuplicatesBatchSampler

* formatting

* simplify tests

[chore] Add test for RoundrobinBatchSampler (UKPLab#2798)

* Add test for RoundrobinBatchSampler

* fix test

* improve RoundRobinBatchSampler and add additional test

* Make datasets in ConcatDataset different sizes

As the real "use case" of the RoundRobin sampler is to avoid sampling from one dataset more than from another. This is best tested when the datasets have different sizes.

---------

Co-authored-by: Tom Aarsen <[email protected]>

[feat] Improve GroupByLabelBatchSampler (UKPLab#2788)

* Improve GroupByLabelBatchSampler

* small fix

* improve test

* Update sentence_transformers/sampler.py

Co-authored-by: Tom Aarsen <[email protected]>

* fix sampler and add unit test

* fix comment

* remove .DS_Store

* rm DS_Store

* change self.groups statement

* move to damplers dir

* Update sentence_transformers/sampler.py

Co-authored-by: Tom Aarsen <[email protected]>

* Add typing

---------

Co-authored-by: Tom Aarsen <[email protected]>
Co-authored-by: Tom Aarsen <[email protected]>

[`chore`] Clean-up `.gitignore` (UKPLab#2799)

add test coverage command

add to workflow

fix cicd

fix cicd

fix

leave cicd untouched

fix gitignore

fix gitignore

update gitignore

update gitignore

fix gitignore

fix gitignor
tomaarsen added a commit that referenced this pull request Jul 9, 2024
#2794)

* Update outdated docs links

Allow inheriting the Transformer class (#2810)

[`feat`] Add hard negatives mining utility (#2768)

* Add hard negatives mining utility

* Add example datasets/models for hard negative mining tip

* Update phrasing in dataset overview

[chore] add test for NoDuplicatesBatchSampler (#2795)

* add test for NoDuplicatesBatchSampler

* formatting

* simplify tests

[chore] Add test for RoundrobinBatchSampler (#2798)

* Add test for RoundrobinBatchSampler

* fix test

* improve RoundRobinBatchSampler and add additional test

* Make datasets in ConcatDataset different sizes

As the real "use case" of the RoundRobin sampler is to avoid sampling from one dataset more than from another. This is best tested when the datasets have different sizes.

---------

Co-authored-by: Tom Aarsen <[email protected]>

[feat] Improve GroupByLabelBatchSampler (#2788)

* Improve GroupByLabelBatchSampler

* small fix

* improve test

* Update sentence_transformers/sampler.py

Co-authored-by: Tom Aarsen <[email protected]>

* fix sampler and add unit test

* fix comment

* remove .DS_Store

* rm DS_Store

* change self.groups statement

* move to damplers dir

* Update sentence_transformers/sampler.py

Co-authored-by: Tom Aarsen <[email protected]>

* Add typing

---------

Co-authored-by: Tom Aarsen <[email protected]>
Co-authored-by: Tom Aarsen <[email protected]>

[`chore`] Clean-up `.gitignore` (#2799)

add test coverage command

add to workflow

fix cicd

fix cicd

fix

leave cicd untouched

fix gitignore

fix gitignore

update gitignore

update gitignore

fix gitignore

fix gitignor

* add command to open cov

* fix setup.py

* remove open command

---------

Co-authored-by: Tom Aarsen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants