Fast loading for cross lingual tasks #572

loicmagne · 2024-04-25T14:37:02Z

This PR implements the changes discussed here #530

I only implemented the fast loading for Tatoeba for testing purpose, I'll do another PR to convert others datasets

I created a different dataset on HF for testing purpose (will swap to mteb later), but note that this is fully backward compatible, people with old versions of the MTEB package will be able to load fast datasets as it just add a 'lang' column

I checked that the results between the slow and fast version are the same. Data loading goes from ~10min from network to ~30s on Tatoeba which "only" has ~100 subsets

mteb/tasks/BitextMining/multilingual/TatoebaBitextMining.py

mteb/abstasks/CrosslingualTask.py

loicmagne · 2024-04-25T15:09:45Z

This test is failing because of the mocking I don't exactly understand why

mteb/tests/test_all_abstasks.py

Lines 19 to 33 in 28e5522

    
           @patch("datasets.load_dataset") 
        
           def test_load_data(mock_load_dataset: Mock, task: AbsTask): 
        
               # TODO: We skip because this load_data is completely different. 
        
               if isinstance(task, AbsTaskRetrieval) or isinstance( 
        
                   task, AbsTaskInstructionRetrieval 
        
               ): 
        
                   pytest.skip() 
        
               with patch.object(task, "dataset_transform") as mock_dataset_transform: 
        
                   task.load_data() 
        
                   mock_load_dataset.assert_called() 
        
                   # They don't yet but should they so they can be expanded more easily? 
        
                   if not task.is_crosslingual and not task.is_multilingual: 
        
                       mock_dataset_transform.assert_called_once()

mteb/abstasks/CrosslingualTask.py

KennethEnevoldsen · 2024-04-26T22:08:05Z

This looks very good @loicmagne:

A main considerations:

I think we should just use this as the new default (no reason to support two methods)
- We can make a script to transfer the datasets and re-upload them
- There are a few STS tasks that use cross-lingual (we might consider those as well)

loicmagne · 2024-04-28T15:00:29Z

I've added documentation for fast loading, and I've made it a mixin instead of a specific implementation of CrosslingualTasks. This way we can directly use it with MultilingualTask, as some of these tasks might also benefit from it

This looks very good @loicmagne:

A main considerations:

* I think we should just use this as the new default (no reason to support two methods)
  
  * We can make a script to transfer the datasets and re-upload them
  * There are a few STS tasks that use cross-lingual (we might consider those as well)

Yes I think it would be fair to use this as the default. I'm not really sure how to automate the conversion with a script though, since it also requires changing the config files ? Currently I did it manually depending on the format of each dataset but maybe there's a simpler way

For now I would let "fast loading" as a toggled option until everything is converted

KennethEnevoldsen

@loicmagne I think this looks very solid. Will you add points (I believe this is worth a solid 10?) and then I will merge it in

Will you also create an issue on converting current datasets to the new format

loicmagne · 2024-04-29T12:47:02Z

I moved the updated Tatoeba dataset to the MTEB org on HF, I think we can merge now @KennethEnevoldsen if everything's ok

I'll do the PR to update other datasets

loicmagne · 2024-04-30T13:12:17Z

@KennethEnevoldsen can we merge this ?

KennethEnevoldsen · 2024-04-30T19:54:45Z

@loicmagne I think we def. can. I will set it to auto-merge

fast loading for bitext mining

735e1e4

loicmagne requested a review from KennethEnevoldsen April 25, 2024 14:37

loicmagne self-assigned this Apr 25, 2024

lint

e722dce

loicmagne commented Apr 25, 2024

View reviewed changes

mteb/tasks/BitextMining/multilingual/TatoebaBitextMining.py Outdated Show resolved Hide resolved

loicmagne commented Apr 25, 2024

View reviewed changes

mteb/abstasks/CrosslingualTask.py Outdated Show resolved Hide resolved

loicmagne added 3 commits April 25, 2024 16:58

consistency

f65cdcb

bump datasets version

862b128

add polars dependency

54c01db

imenelydiaker reviewed Apr 26, 2024

View reviewed changes

mteb/abstasks/CrosslingualTask.py Outdated Show resolved Hide resolved

davidstap mentioned this pull request Apr 28, 2024

Add BibleNLP dataset #583

Merged

10 tasks

loicmagne added 3 commits April 28, 2024 16:29

loader mixin

ca1210c

documentation for fast loading

1c0e246

detail

c2b208c

loicmagne and others added 2 commits April 28, 2024 17:27

fix tests

d08cdf6

Merge branch 'main' into fast-subsets

f3c478c

KennethEnevoldsen approved these changes Apr 29, 2024

View reviewed changes

loicmagne added 2 commits April 29, 2024 14:38

tatoeba dataset in mteb org

2012f67

added points

6033f01

loicmagne added 2 commits April 29, 2024 14:51

Merge branch 'embeddings-benchmark:main' into fast-subsets

ed0fe57

Merge branch 'main' into fast-subsets

47f064e

Merge branch 'main' into fast-subsets

0e35303

KennethEnevoldsen enabled auto-merge (squash) April 30, 2024 19:54

KennethEnevoldsen mentioned this pull request Apr 30, 2024

add xsimplusplus in retrieval category #601

Closed

10 tasks

KennethEnevoldsen merged commit aa2ffe8 into embeddings-benchmark:main Apr 30, 2024
7 checks passed

loicmagne mentioned this pull request May 5, 2024

fix: Convert Multilingual/Crosslingual to fast-loading format #635

Merged

19 tasks

KennethEnevoldsen mentioned this pull request May 9, 2024

Sampling MTEB #647

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast loading for cross lingual tasks #572

Fast loading for cross lingual tasks #572

loicmagne commented Apr 25, 2024

loicmagne commented Apr 25, 2024 •

edited

Loading

KennethEnevoldsen commented Apr 26, 2024

loicmagne commented Apr 28, 2024 •

edited

Loading

KennethEnevoldsen left a comment

loicmagne commented Apr 29, 2024

loicmagne commented Apr 30, 2024

KennethEnevoldsen commented Apr 30, 2024

Fast loading for cross lingual tasks #572

Fast loading for cross lingual tasks #572

Conversation

loicmagne commented Apr 25, 2024

loicmagne commented Apr 25, 2024 • edited Loading

KennethEnevoldsen commented Apr 26, 2024

loicmagne commented Apr 28, 2024 • edited Loading

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

loicmagne commented Apr 29, 2024

loicmagne commented Apr 30, 2024

KennethEnevoldsen commented Apr 30, 2024

loicmagne commented Apr 25, 2024 •

edited

Loading

loicmagne commented Apr 28, 2024 •

edited

Loading