-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast loading for cross lingual tasks #572
Fast loading for cross lingual tasks #572
Conversation
This test is failing because of the mocking I don't exactly understand why mteb/tests/test_all_abstasks.py Lines 19 to 33 in 28e5522
|
This looks very good @loicmagne: A main considerations:
|
I've added documentation for fast loading, and I've made it a mixin instead of a specific implementation of CrosslingualTasks. This way we can directly use it with MultilingualTask, as some of these tasks might also benefit from it
Yes I think it would be fair to use this as the default. I'm not really sure how to automate the conversion with a script though, since it also requires changing the config files ? Currently I did it manually depending on the format of each dataset but maybe there's a simpler way For now I would let "fast loading" as a toggled option until everything is converted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@loicmagne I think this looks very solid. Will you add points (I believe this is worth a solid 10?) and then I will merge it in
Will you also create an issue on converting current datasets to the new format
I moved the updated Tatoeba dataset to the MTEB org on HF, I think we can merge now @KennethEnevoldsen if everything's ok I'll do the PR to update other datasets |
@KennethEnevoldsen can we merge this ? |
@loicmagne I think we def. can. I will set it to auto-merge |
This PR implements the changes discussed here #530
I only implemented the fast loading for Tatoeba for testing purpose, I'll do another PR to convert others datasets
I created a different dataset on HF for testing purpose (will swap to mteb later), but note that this is fully backward compatible, people with old versions of the MTEB package will be able to load fast datasets as it just add a 'lang' column
I checked that the results between the slow and fast version are the same. Data loading goes from ~10min from network to ~30s on Tatoeba which "only" has ~100 subsets