add xsim++ task under retrieval category #609

jaygala24 · 2024-05-01T04:24:47Z

Checklist for adding MMTEB dataset

Reason for dataset addition:

jaygala24 · 2024-05-01T04:29:22Z

Sorry, I have opened a new PR for the previously closed PR #601 as I accidentally messed up with the sync of the forked repo.

I was trying to make the current dataset compatible with fast loading but however, it seems I am getting the following error. I did look up the issue (huggingface/datasets#5612) on datasets library but it seems that issue is not resolved.

raise ValueError(f"Arrow type {arrow_type} does not have a datasets dtype equivalent.")
ValueError: Arrow type large_list<item: large_string> does not have a datasets dtype equivalent.

Please let me know how we should proceed for this PR.

Andrian0s · 2024-05-01T08:08:47Z

Isn't this by definition a CrossLingual Task (EN-Other languages) and therefore, shouldn't this extend the CrosslingualTask instead of MultilingualTask? Then you would also need to change your language definitions

Andrian0s · 2024-05-01T09:44:31Z

Added some comments in the previous PR #601 as I am implementing something similar now and I have been working on the same issues.

KennethEnevoldsen · 2024-05-01T12:11:56Z

Isn't this by definition a CrossLingual Task (EN-Other languages) and therefore, shouldn't this extend the CrosslingualTask instead of MultilingualTask? Then you would also need to change your language definitions

I would use the cross-lingual in this case.

Regarding the fast loading since you overwrite the load_data method I don't believe it would actually do anything.

Atm. however I examined the comments of @Andrian0s (in the original PR). These seem to raise whether the current task formulation is relevant. I believe it might be worth taking that discussion first before moving on. @jaygala24 will you address the comments (let us keep it in this thread) - it might be worth either introducing a variant task (which seems to be required due to the encode_corpus/encode_query).

imenelydiaker · 2024-05-02T08:05:02Z

This PR #560 will allow Retrieval tasks to handle CrossLingual tasks also

KennethEnevoldsen · 2024-05-21T09:47:03Z

@jaygala24 seems like this PR has gone stale. Will close it for now but feel free to re-open it if you want to finish it up

jaygala24 · 2024-05-30T16:40:36Z

@KennethEnevoldsen Sorry, I was away due to some personal reasons so couldn't follow up with the discussion here. I'll review the entire discussion and then either re-open this PR or make a new PR.

KennethEnevoldsen · 2024-05-31T11:45:57Z

Wonderful @jaygala24 glad to have you back!

add xsim++ task under retrieval category

08743aa

KennethEnevoldsen self-assigned this May 1, 2024

KennethEnevoldsen marked this pull request as draft May 1, 2024 12:12

KennethEnevoldsen closed this May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add xsim++ task under retrieval category #609

add xsim++ task under retrieval category #609

jaygala24 commented May 1, 2024

jaygala24 commented May 1, 2024

Andrian0s commented May 1, 2024

Andrian0s commented May 1, 2024

KennethEnevoldsen commented May 1, 2024

imenelydiaker commented May 2, 2024

KennethEnevoldsen commented May 21, 2024

jaygala24 commented May 30, 2024

KennethEnevoldsen commented May 31, 2024

add xsim++ task under retrieval category #609

add xsim++ task under retrieval category #609

Conversation

jaygala24 commented May 1, 2024

Checklist for adding MMTEB dataset

jaygala24 commented May 1, 2024

Andrian0s commented May 1, 2024

Andrian0s commented May 1, 2024

KennethEnevoldsen commented May 1, 2024

imenelydiaker commented May 2, 2024

KennethEnevoldsen commented May 21, 2024

jaygala24 commented May 30, 2024

KennethEnevoldsen commented May 31, 2024