-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add xsimplusplus in retrieval category #601
Conversation
Adding xsim++ dataset spanning over 200 languages in retrieval category task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question related to the task formulation
"path": "jaygala24/xsimplusplus", | ||
"revision": "07f92f877ea651659f3815884761c49191a1c80c", | ||
}, | ||
description="xsim++ is dataset created to capture more subtle improvements in bitext mining by adding challenging negative examples.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bitext mining? What is the reason we formulate it as a retrieval task then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default formulation for bitext mining expects one-to-one pair mapping between source and target translations where all the examples except the corresponding index act as negative samples. However, in the case of xsim++, there are a variable number of negative samples for each of the English sentences based on the perturbation choices and note that there is no corresponding positive pair (perturbation applied on target language sentences).
The most appropriate task formulation that we think fits with the xsim++ dataset in MTEB is retrieval where each query (target language sentence) can have a variable number of positive (ground truth English sentence) and negative pairs (perturbed English sentences). This is similar to the following task in the retrieval category as of now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jaygala24, my suspicion was that it was due to the negative examples as well. Do note that it is possible to introduce new task types to MTEB, however, let us keep it as a retrieval task for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah now I followed through this conversation. Xsim++ was developed to evaluate Bitext mining systems cross-lingually.
Implementation wise, a Retrieval Evaluation code re-use makes sense. However, the retrieval metrics are not really relevant to the task (there are no evaluations of them within the paper), as they only mention their pairwise models accuracy. Perhaps I would suggest finding a case where this dataset was used beyond pairwise comparison and providing the same metrics.
Conceptually, it is an adversarial evaluation, similar to what Cross Lingual Semantic Discrimination Task (pr coming) will be. When I am through with that addition, it might make sense to have xsim++ as a tab within that section
Moving this also here for visibility:
Isn't this by definition a CrossLingual Task (EN-Other languages) and therefore, shouldn't this extend the CrosslingualTask instead of MultilingualTask? Could this fix the issue without being able to use fast-load?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And one more thing, using the retrieval code, you are encoding the "corpus" - which is again similar size text to the original using model.encode_corpus. In models where this matter (e5-multilingual for example), this would give the wrong prompt to the model, altering the results.
Consider adding a binary variable in RetrievalEvaluator within the search function, which will allow for the case that the corpus also needs to be encoded using the encode_queries. Should be an easy fix and will bring difference in results.
year = {2023}, | ||
doi = {10.48550/arXiv.2306.12907}, | ||
}""", | ||
n_samples={"dev": 997, "devtest": 1012}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there really only 1012 samples in total?
Any reason to use both the dev and devtest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dataset has 1012 samples per language and there are 200 languages so in total it would amount to 200k samples.
The dataset has both dev and devtest splits but we can choose to only include the devtest as the eval split as people usually report the result for the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove the dev set (200k is already on the higher end). Please add the fast loading introduced in #572 to avoid extensive long loading times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then feel free to add the points as well as updating the scores using the fast implementation (just to show that the implementation work as intended)
Checklist for adding MMTEB dataset
Reason for dataset addition:
mteb
package.mteb run -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.438.jsonl
).