Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast loading for cross lingual tasks #572

Merged
merged 15 commits into from
Apr 30, 2024

Conversation

loicmagne
Copy link
Member

This PR implements the changes discussed here #530

I only implemented the fast loading for Tatoeba for testing purpose, I'll do another PR to convert others datasets

I created a different dataset on HF for testing purpose (will swap to mteb later), but note that this is fully backward compatible, people with old versions of the MTEB package will be able to load fast datasets as it just add a 'lang' column

I checked that the results between the slow and fast version are the same. Data loading goes from ~10min from network to ~30s on Tatoeba which "only" has ~100 subsets

@loicmagne loicmagne self-assigned this Apr 25, 2024
@loicmagne
Copy link
Member Author

loicmagne commented Apr 25, 2024

This test is failing because of the mocking I don't exactly understand why

@patch("datasets.load_dataset")
def test_load_data(mock_load_dataset: Mock, task: AbsTask):
# TODO: We skip because this load_data is completely different.
if isinstance(task, AbsTaskRetrieval) or isinstance(
task, AbsTaskInstructionRetrieval
):
pytest.skip()
with patch.object(task, "dataset_transform") as mock_dataset_transform:
task.load_data()
mock_load_dataset.assert_called()
# They don't yet but should they so they can be expanded more easily?
if not task.is_crosslingual and not task.is_multilingual:
mock_dataset_transform.assert_called_once()

@KennethEnevoldsen
Copy link
Contributor

This looks very good @loicmagne:

A main considerations:

  • I think we should just use this as the new default (no reason to support two methods)
    • We can make a script to transfer the datasets and re-upload them
    • There are a few STS tasks that use cross-lingual (we might consider those as well)

@davidstap davidstap mentioned this pull request Apr 28, 2024
10 tasks
@loicmagne
Copy link
Member Author

loicmagne commented Apr 28, 2024

I've added documentation for fast loading, and I've made it a mixin instead of a specific implementation of CrosslingualTasks. This way we can directly use it with MultilingualTask, as some of these tasks might also benefit from it

This looks very good @loicmagne:

A main considerations:

* I think we should just use this as the new default (no reason to support two methods)
  
  * We can make a script to transfer the datasets and re-upload them
  * There are a few STS tasks that use cross-lingual (we might consider those as well)

Yes I think it would be fair to use this as the default. I'm not really sure how to automate the conversion with a script though, since it also requires changing the config files ? Currently I did it manually depending on the format of each dataset but maybe there's a simpler way

For now I would let "fast loading" as a toggled option until everything is converted

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loicmagne I think this looks very solid. Will you add points (I believe this is worth a solid 10?) and then I will merge it in

Will you also create an issue on converting current datasets to the new format

@loicmagne
Copy link
Member Author

I moved the updated Tatoeba dataset to the MTEB org on HF, I think we can merge now @KennethEnevoldsen if everything's ok

I'll do the PR to update other datasets

@loicmagne
Copy link
Member Author

@KennethEnevoldsen can we merge this ?

@KennethEnevoldsen
Copy link
Contributor

@loicmagne I think we def. can. I will set it to auto-merge

@KennethEnevoldsen KennethEnevoldsen merged commit aa2ffe8 into embeddings-benchmark:main Apr 30, 2024
7 checks passed
@KennethEnevoldsen KennethEnevoldsen mentioned this pull request May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants