Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add xsim++ task under retrieval category #609

Closed
wants to merge 1 commit into from

Conversation

jaygala24
Copy link
Contributor

Checklist for adding MMTEB dataset

Reason for dataset addition:

  • I have tested that the dataset runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
  • I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

@jaygala24
Copy link
Contributor Author

@KennethEnevoldsen

Sorry, I have opened a new PR for the previously closed PR #601 as I accidentally messed up with the sync of the forked repo.

I was trying to make the current dataset compatible with fast loading but however, it seems I am getting the following error. I did look up the issue (huggingface/datasets#5612) on datasets library but it seems that issue is not resolved.

raise ValueError(f"Arrow type {arrow_type} does not have a datasets dtype equivalent.")
ValueError: Arrow type large_list<item: large_string> does not have a datasets dtype equivalent.

Please let me know how we should proceed for this PR.

@Andrian0s
Copy link
Contributor

Isn't this by definition a CrossLingual Task (EN-Other languages) and therefore, shouldn't this extend the CrosslingualTask instead of MultilingualTask? Then you would also need to change your language definitions

@Andrian0s
Copy link
Contributor

Added some comments in the previous PR #601 as I am implementing something similar now and I have been working on the same issues.

@KennethEnevoldsen
Copy link
Contributor

Isn't this by definition a CrossLingual Task (EN-Other languages) and therefore, shouldn't this extend the CrosslingualTask instead of MultilingualTask? Then you would also need to change your language definitions

I would use the cross-lingual in this case.

Regarding the fast loading since you overwrite the load_data method I don't believe it would actually do anything.

Atm. however I examined the comments of @Andrian0s (in the original PR). These seem to raise whether the current task formulation is relevant. I believe it might be worth taking that discussion first before moving on. @jaygala24 will you address the comments (let us keep it in this thread) - it might be worth either introducing a variant task (which seems to be required due to the encode_corpus/encode_query).

@KennethEnevoldsen KennethEnevoldsen self-assigned this May 1, 2024
@KennethEnevoldsen KennethEnevoldsen marked this pull request as draft May 1, 2024 12:12
@imenelydiaker
Copy link
Contributor

This PR #560 will allow Retrieval tasks to handle CrossLingual tasks also

@KennethEnevoldsen
Copy link
Contributor

@jaygala24 seems like this PR has gone stale. Will close it for now but feel free to re-open it if you want to finish it up

@jaygala24
Copy link
Contributor Author

@KennethEnevoldsen Sorry, I was away due to some personal reasons so couldn't follow up with the discussion here. I'll review the entire discussion and then either re-open this PR or make a new PR.

@KennethEnevoldsen
Copy link
Contributor

Wonderful @jaygala24 glad to have you back!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants