Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add xsimplusplus in retrieval category #601

Closed
wants to merge 0 commits into from

Conversation

jaygala24
Copy link
Contributor

Checklist for adding MMTEB dataset

Reason for dataset addition:

  • I have tested that the dataset runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
  • I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

@jaygala24
Copy link
Contributor Author

jaygala24 commented Apr 29, 2024

Adding xsim++ dataset spanning over 200 languages in retrieval category task.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question related to the task formulation

"path": "jaygala24/xsimplusplus",
"revision": "07f92f877ea651659f3815884761c49191a1c80c",
},
description="xsim++ is dataset created to capture more subtle improvements in bitext mining by adding challenging negative examples.",
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bitext mining? What is the reason we formulate it as a retrieval task then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default formulation for bitext mining expects one-to-one pair mapping between source and target translations where all the examples except the corresponding index act as negative samples. However, in the case of xsim++, there are a variable number of negative samples for each of the English sentences based on the perturbation choices and note that there is no corresponding positive pair (perturbation applied on target language sentences).

The most appropriate task formulation that we think fits with the xsim++ dataset in MTEB is retrieval where each query (target language sentence) can have a variable number of positive (ground truth English sentence) and negative pairs (perturbed English sentences). This is similar to the following task in the retrieval category as of now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jaygala24, my suspicion was that it was due to the negative examples as well. Do note that it is possible to introduce new task types to MTEB, however, let us keep it as a retrieval task for now.

Copy link
Contributor

@Andrian0s Andrian0s May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaygala24 @KennethEnevoldsen

Ah now I followed through this conversation. Xsim++ was developed to evaluate Bitext mining systems cross-lingually.

Implementation wise, a Retrieval Evaluation code re-use makes sense. However, the retrieval metrics are not really relevant to the task (there are no evaluations of them within the paper), as they only mention their pairwise models accuracy. Perhaps I would suggest finding a case where this dataset was used beyond pairwise comparison and providing the same metrics.

Conceptually, it is an adversarial evaluation, similar to what Cross Lingual Semantic Discrimination Task (pr coming) will be. When I am through with that addition, it might make sense to have xsim++ as a tab within that section

Moving this also here for visibility:
Isn't this by definition a CrossLingual Task (EN-Other languages) and therefore, shouldn't this extend the CrosslingualTask instead of MultilingualTask? Could this fix the issue without being able to use fast-load?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And one more thing, using the retrieval code, you are encoding the "corpus" - which is again similar size text to the original using model.encode_corpus. In models where this matter (e5-multilingual for example), this would give the wrong prompt to the model, altering the results.

Consider adding a binary variable in RetrievalEvaluator within the search function, which will allow for the case that the corpus also needs to be encoded using the encode_queries. Should be an easy fix and will bring difference in results.

year = {2023},
doi = {10.48550/arXiv.2306.12907},
}""",
n_samples={"dev": 997, "devtest": 1012},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there really only 1012 samples in total?

Any reason to use both the dev and devtest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset has 1012 samples per language and there are 200 languages so in total it would amount to 200k samples.

The dataset has both dev and devtest splits but we can choose to only include the devtest as the eval split as people usually report the result for the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the dev set (200k is already on the higher end). Please add the fast loading introduced in #572 to avoid extensive long loading times.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then feel free to add the points as well as updating the scores using the fast implementation (just to show that the implementation work as intended)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants