add xsimplusplus in retrieval category #601

jaygala24 · 2024-04-29T16:33:17Z

Checklist for adding MMTEB dataset

Reason for dataset addition:

jaygala24 · 2024-04-29T16:39:44Z

Adding xsim++ dataset spanning over 200 languages in retrieval category task.

KennethEnevoldsen

A question related to the task formulation

KennethEnevoldsen · 2024-04-29T18:11:26Z

mteb/tasks/Retrieval/multilingual/XSimPlusPlusRetrieval.py

+            "path": "jaygala24/xsimplusplus",
+            "revision": "07f92f877ea651659f3815884761c49191a1c80c",
+        },
+        description="xsim++ is dataset created to capture more subtle improvements in bitext mining by adding challenging negative examples.",


Bitext mining? What is the reason we formulate it as a retrieval task then?

The default formulation for bitext mining expects one-to-one pair mapping between source and target translations where all the examples except the corresponding index act as negative samples. However, in the case of xsim++, there are a variable number of negative samples for each of the English sentences based on the perturbation choices and note that there is no corresponding positive pair (perturbation applied on target language sentences).

The most appropriate task formulation that we think fits with the xsim++ dataset in MTEB is retrieval where each query (target language sentence) can have a variable number of positive (ground truth English sentence) and negative pairs (perturbed English sentences). This is similar to the following task in the retrieval category as of now.

Thanks @jaygala24, my suspicion was that it was due to the negative examples as well. Do note that it is possible to introduce new task types to MTEB, however, let us keep it as a retrieval task for now.

@jaygala24 @KennethEnevoldsen

Ah now I followed through this conversation. Xsim++ was developed to evaluate Bitext mining systems cross-lingually.

Implementation wise, a Retrieval Evaluation code re-use makes sense. However, the retrieval metrics are not really relevant to the task (there are no evaluations of them within the paper), as they only mention their pairwise models accuracy. Perhaps I would suggest finding a case where this dataset was used beyond pairwise comparison and providing the same metrics.

Conceptually, it is an adversarial evaluation, similar to what Cross Lingual Semantic Discrimination Task (pr coming) will be. When I am through with that addition, it might make sense to have xsim++ as a tab within that section

Moving this also here for visibility:
Isn't this by definition a CrossLingual Task (EN-Other languages) and therefore, shouldn't this extend the CrosslingualTask instead of MultilingualTask? Could this fix the issue without being able to use fast-load?

And one more thing, using the retrieval code, you are encoding the "corpus" - which is again similar size text to the original using model.encode_corpus. In models where this matter (e5-multilingual for example), this would give the wrong prompt to the model, altering the results.

Consider adding a binary variable in RetrievalEvaluator within the search function, which will allow for the case that the corpus also needs to be encoded using the encode_queries. Should be an easy fix and will bring difference in results.

KennethEnevoldsen · 2024-04-29T18:12:26Z

mteb/tasks/Retrieval/multilingual/XSimPlusPlusRetrieval.py

+  year      = {2023},
+  doi       = {10.48550/arXiv.2306.12907},
+}""",
+        n_samples={"dev": 997, "devtest": 1012},


Is there really only 1012 samples in total?

Any reason to use both the dev and devtest?

The dataset has 1012 samples per language and there are 200 languages so in total it would amount to 200k samples.

The dataset has both dev and devtest splits but we can choose to only include the devtest as the eval split as people usually report the result for the same.

We should remove the dev set (200k is already on the higher end). Please add the fast loading introduced in #572 to avoid extensive long loading times.

Then feel free to add the points as well as updating the scores using the fast implementation (just to show that the implementation work as intended)

KennethEnevoldsen reviewed Apr 29, 2024

View reviewed changes

isaac-chung assigned KennethEnevoldsen Apr 30, 2024

jaygala24 closed this May 1, 2024

jaygala24 force-pushed the main branch from 3929517 to 0543014 Compare May 1, 2024 03:22

jaygala24 mentioned this pull request May 1, 2024

add xsim++ task under retrieval category #609

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add xsimplusplus in retrieval category #601

add xsimplusplus in retrieval category #601

jaygala24 commented Apr 29, 2024

jaygala24 commented Apr 29, 2024 •

edited

Loading

KennethEnevoldsen left a comment

KennethEnevoldsen Apr 29, 2024 •

edited

Loading

jaygala24 Apr 30, 2024

KennethEnevoldsen Apr 30, 2024

Andrian0s May 1, 2024 •

edited

Loading

Andrian0s May 1, 2024

KennethEnevoldsen Apr 29, 2024

jaygala24 Apr 30, 2024

KennethEnevoldsen Apr 30, 2024

KennethEnevoldsen Apr 30, 2024

add xsimplusplus in retrieval category #601

add xsimplusplus in retrieval category #601

Conversation

jaygala24 commented Apr 29, 2024

Checklist for adding MMTEB dataset

jaygala24 commented Apr 29, 2024 • edited Loading

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

KennethEnevoldsen Apr 29, 2024 • edited Loading

Choose a reason for hiding this comment

jaygala24 Apr 30, 2024

Choose a reason for hiding this comment

KennethEnevoldsen Apr 30, 2024

Choose a reason for hiding this comment

Andrian0s May 1, 2024 • edited Loading

Choose a reason for hiding this comment

Andrian0s May 1, 2024

Choose a reason for hiding this comment

KennethEnevoldsen Apr 29, 2024

Choose a reason for hiding this comment

jaygala24 Apr 30, 2024

Choose a reason for hiding this comment

KennethEnevoldsen Apr 30, 2024

Choose a reason for hiding this comment

KennethEnevoldsen Apr 30, 2024

Choose a reason for hiding this comment

jaygala24 commented Apr 29, 2024 •

edited

Loading

KennethEnevoldsen Apr 29, 2024 •

edited

Loading

Andrian0s May 1, 2024 •

edited

Loading