-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BibleNLP dataset #583
Add BibleNLP dataset #583
Conversation
Could you review this PR @KennethEnevoldsen @imenelydiaker? I'm personally quite excited by the large language coverage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! Just a few comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really interesting dataset :) find some of my suggestions below
I have addressed your comments @isaac-chung @dokato, thanks for your suggestions! Results are included now as well. I think this is ready to merge now, do you agree? Let me know, then I'll calculate the points. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating! I think it's good enough to merge. In terms of points, the dataset is 2 + 4 * number of new languages added to the BitextMining task. Please check if any of the languages have already been covered in that folder and the multilingual subfolder as well. Then 2 pts each to @dokato and I for the reviews.
Thanks for your quick response! I calculated the number of new languages using this script: 756 new languages (out of 829 languages used in this task). I'll update the points and merge. |
Ready to merge now @isaac-chung (I don't have write access) |
Checklist for adding MMTEB dataset
This dataset has an extremely large coverage of 829 languages, and includes numerous low-resource languages. For this reason I only considered English-centric directions, resulting in a reasonable 1656 pairs. I will add performance results later, but initial testing suggests that while quality is not good for most (low-resource) languages, it is far better than random guesses, so IMO a good addition to the benchmark.
mteb
package.mteb run -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.438.jsonl
).