Add BibleNLP dataset #583

davidstap · 2024-04-26T14:53:33Z

Checklist for adding MMTEB dataset

This dataset has an extremely large coverage of 829 languages, and includes numerous low-resource languages. For this reason I only considered English-centric directions, resulting in a reasonable 1656 pairs. I will add performance results later, but initial testing suggests that while quality is not good for most (low-resource) languages, it is far better than random guesses, so IMO a good addition to the benchmark.

davidstap · 2024-04-26T14:55:17Z

Could you review this PR @KennethEnevoldsen @imenelydiaker? I'm personally quite excited by the large language coverage.

isaac-chung

This is awesome! Just a few comments.

mteb/tasks/BitextMining/multilingual/BibleNLPBitextMining.py

dokato

Really interesting dataset :) find some of my suggestions below

mteb/tasks/BitextMining/multilingual/BibleNLPBitextMining.py

davidstap · 2024-04-28T08:41:40Z

I have addressed your comments @isaac-chung @dokato, thanks for your suggestions!

Results are included now as well. intfloat/multilingual-e5-small performs best, but like I pointed out the performance is bad for most low-resource languages. However, almost all directions are way better than random chance (1/256), so this seems like a valuable addition to evaluate (future) multilingual models.

I think this is ready to merge now, do you agree? Let me know, then I'll calculate the points. Thanks!

isaac-chung

Thanks for iterating! I think it's good enough to merge. In terms of points, the dataset is 2 + 4 * number of new languages added to the BitextMining task. Please check if any of the languages have already been covered in that folder and the multilingual subfolder as well. Then 2 pts each to @dokato and I for the reviews.

davidstap · 2024-04-28T12:20:57Z

Thanks for your quick response! I calculated the number of new languages using this script: 756 new languages (out of 829 languages used in this task).

I'll update the points and merge.

davidstap · 2024-04-28T12:41:02Z

Ready to merge now @isaac-chung (I don't have write access)

Add BibleNLP multilingual corpus

537cf7c

isaac-chung reviewed Apr 26, 2024

View reviewed changes

mteb/tasks/BitextMining/multilingual/BibleNLPBitextMining.py Show resolved Hide resolved

mteb/tasks/BitextMining/multilingual/BibleNLPBitextMining.py Show resolved Hide resolved

dokato reviewed Apr 26, 2024

View reviewed changes

imenelydiaker assigned isaac-chung Apr 26, 2024

added results

0f825ff

davidstap added 2 commits April 28, 2024 10:43

addressed review comments

80dd8ab

Merge branch 'main' into bible

2ef858b

davidstap marked this pull request as ready for review April 28, 2024 10:55

isaac-chung approved these changes Apr 28, 2024

View reviewed changes

davidstap added 5 commits April 28, 2024 14:27

calculated points

52a7cd2

Merge branch 'bible' of github.com:davidstap/mteb into bible

832323a

Merge branch 'main' into bible

2510513

fix style typo

a47d80e

fix style typo

6cd40e5

isaac-chung merged commit 593fc8f into embeddings-benchmark:main Apr 28, 2024
7 checks passed

KennethEnevoldsen mentioned this pull request May 10, 2024

Add Belebele Retrieval #636

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BibleNLP dataset #583

Add BibleNLP dataset #583

davidstap commented Apr 26, 2024 •

edited

Loading

davidstap commented Apr 26, 2024 •

edited

Loading

isaac-chung left a comment

dokato left a comment

davidstap commented Apr 28, 2024

isaac-chung left a comment

davidstap commented Apr 28, 2024 •

edited

Loading

davidstap commented Apr 28, 2024

Add BibleNLP dataset #583

Add BibleNLP dataset #583

Conversation

davidstap commented Apr 26, 2024 • edited Loading

Checklist for adding MMTEB dataset

davidstap commented Apr 26, 2024 • edited Loading

isaac-chung left a comment

Choose a reason for hiding this comment

dokato left a comment

Choose a reason for hiding this comment

davidstap commented Apr 28, 2024

isaac-chung left a comment

Choose a reason for hiding this comment

davidstap commented Apr 28, 2024 • edited Loading

davidstap commented Apr 28, 2024

davidstap commented Apr 26, 2024 •

edited

Loading

davidstap commented Apr 26, 2024 •

edited

Loading

davidstap commented Apr 28, 2024 •

edited

Loading