-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add xsim++ task under retrieval category #609
Conversation
Sorry, I have opened a new PR for the previously closed PR #601 as I accidentally messed up with the sync of the forked repo. I was trying to make the current dataset compatible with fast loading but however, it seems I am getting the following error. I did look up the issue (huggingface/datasets#5612) on datasets library but it seems that issue is not resolved.
Please let me know how we should proceed for this PR. |
Isn't this by definition a CrossLingual Task (EN-Other languages) and therefore, shouldn't this extend the CrosslingualTask instead of MultilingualTask? Then you would also need to change your language definitions |
Added some comments in the previous PR #601 as I am implementing something similar now and I have been working on the same issues. |
I would use the cross-lingual in this case. Regarding the fast loading since you overwrite the load_data method I don't believe it would actually do anything. Atm. however I examined the comments of @Andrian0s (in the original PR). These seem to raise whether the current task formulation is relevant. I believe it might be worth taking that discussion first before moving on. @jaygala24 will you address the comments (let us keep it in this thread) - it might be worth either introducing a variant task (which seems to be required due to the encode_corpus/encode_query). |
This PR #560 will allow Retrieval tasks to handle CrossLingual tasks also |
@jaygala24 seems like this PR has gone stale. Will close it for now but feel free to re-open it if you want to finish it up |
@KennethEnevoldsen Sorry, I was away due to some personal reasons so couldn't follow up with the discussion here. I'll review the entire discussion and then either re-open this PR or make a new PR. |
Wonderful @jaygala24 glad to have you back! |
Checklist for adding MMTEB dataset
Reason for dataset addition:
mteb
package.mteb run -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.438.jsonl
).