Add Belebele Retrieval #636

jupyterjazz · 2024-05-05T19:13:49Z

Checklist for adding MMTEB dataset

Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 115 distinct languages

Signed-off-by: jupyterjazz <[email protected]>

mteb/tasks/Retrieval/multilingual/BelebeleRetrieval.py

Signed-off-by: jupyterjazz <[email protected]>

mteb/tasks/Retrieval/multilingual/BelebeleRetrieval.py

Signed-off-by: jupyterjazz <[email protected]>

imenelydiaker · 2024-05-07T20:58:06Z

docs/mmteb/points/636.jsonl

@@ -0,0 +1,2 @@
+{"GitHub": "jupyterjazz", "New dataset": 326}


It's 2 points for a New dataset even with multiple languages. You may add 4 points per new language if it's not already handled by another task.

How many points do you suggest I should add? 81 out of the 115 Belebele languages are not included in other retrieval datasets, that's why I added 2 + 81*4, I think this is what other PRs did as well

As far as I know, this dataset spawns from FLORES, which is already covered from datasets likes SIB. Not a retrieval dataset, but still covers the languages

@KennethEnevoldsen we need some help here with the points

Ahh this actually already happened back in #583 and we have had similar cases before (though never on this scale). We might consider changing the new dataset scores to be a max of 50-100. Both in terms on value added to the benchmark and effort I believe this is reasonable. We will have to update previous scores though.

@jupyterjazz how would you feel about such a change? We know that the points system is imperfect (always will be) and we don't want to discourage contributors who have put many hours into review etc. - I hope that the solution can strike the right balance.

Just checking in with @isaac-chung (reviewer, and large contributor) and @davidstap (previous contributor of large dataset) here as well.

Thanks for the response @davidstap. The specific reason why I would like to reduce previous contributions as one dataset has obtained >3000 points for a single dataset contribution. We should have taken the discussion at the time, but there are many PRs and I understand that it wasn't caught.

Keeping the score as it is suggests to the contributors that the bibleNLP dataset is worth more than ALL the remaining contributions combined. I don't believe that is fair to the other contributors of this project so it is clearly something that we will have to change.

Again this should have been caught in review and I am sad that it wasn't. I hope we can find a solution that fits everyone.

I get your concern about one dataset contribution racking up a crazy number of points. It seems unfair when that dataset is worth more than all other contributions combined.

OTOH, I think massively multilingual datasets such as BibleNLP, but also FLORES and NTREX, are extremely valuable for a multilingual benchmark. They cover a ton of languages while keeping the domain and other confounders constant, which is crucial when one wants to test how well different languages perform in the same setting. When you're dealing with languages from different datasets, it's hard to compare performance because there are so many other confounding factors at play, like the topic/domain or length of the samples.

That’s why I think it’s much more valuable to have a single dataset with 500 languages, instead of 10 datasets with 50 distinct languages each, even if the set of languages is the same. The new point system (cap at 50/100 points) would not capture this at all. Of course no point system is perfect, but IMO this change would not be an improvement.

OTOH, I think massively multilingual datasets such as BibleNLP, but also FLORES and NTREX, are extremely valuable for a multilingual benchmark. They cover a ton of languages while keeping the domain and other confounders constant, which is crucial when one wants to test how well different languages perform in the same setting. When you're dealing with languages from different datasets, it's hard to compare performance because there are so many other confounding factors at play, like the topic/domain or length of the samples.

BibleNLP, Flores and NTREX can only be used in one task for a multilingual benchmark, although it's be nice to remind that the key idea behind MTEB is the ability to evaluate a model accross different tasks and languages. These datasets are cross-lingual, most of them are parallel texts that were built using translation, so basically they contain the same information across all languages.

Having distinct datasets with different languages is also valuable since we're not only focused on one style of writing or one source of data. It would clearly be unfair to reward a multilingual dataset more than other contributions like adding a new emmbedding evaluation task.

This new pointing system will not diminish your contribution, you'll still get 100 points for your two datasets which will get you in the top list of co-authors which is already good. It will just make it fair towards other contributors that have also put efforts into this.

Again sorry for the confusion, we should have fixed this earlier. But we're humans and we make mistakes 🙂

That’s why I think it’s much more valuable to have a single dataset with 500 languages, instead of 10 datasets with 50 distinct languages each, even if the set of languages is the same. The new point system (cap at 50/100 points) would not capture this at all. Of course no point system is perfect, but IMO this change would not be an improvement.

of course, diversity of languages is not the only target of measure. Diversity of tasks is also important. E.g. A model performing well on bible datasets might perform well based on heuristics (e.g. does Jesus appear at the beginning of the sentence), but 50x datasets in 50 different languages cover different domains, tasks and sources and thus I would believe measures a broader spectrum (when we do a factor analysis of task x language scores I believe we will see the same thing as well).

With all of this being said. I would suggest that we set a limit to 50 points for the new datasets but raise the point limit to 100 points for already accepted PRs as well as ongoing PRs. I believe this strikes a fair balance between stating that multilingual datasets are extremely valuable while not overstating their value.

@imenelydiaker @KennethEnevoldsen thanks for your additional comments. It seems I have overstated the importance of extremely multilingual datasets in the context of mmteb. I think this stems from my background in multilingual MT, where these kinds of datasets are very valuable. Thanks again for organizing this community effort ☺️

Thanks for understanding @davidstap. Again we very much appreciate your contributions and this is by no means meant to understate it.

Andrian0s · 2024-05-09T22:34:43Z

I would agree we are ready to merge when we fix my minor comments and agree on points.

jupyterjazz · 2024-05-10T06:58:26Z

@Andrian0s I think your comments were not posted, they're not visible on my side

Andrian0s · 2024-05-09T22:25:09Z

docs/mmteb/points/636.jsonl

@@ -0,0 +1,2 @@
+{"GitHub": "jupyterjazz", "New dataset": 326}


As far as I know, this dataset spawns from FLORES, which is already covered from datasets likes SIB. Not a retrieval dataset, but still covers the languages

mteb/tasks/Retrieval/multilingual/BelebeleRetrieval.py

Andrian0s · 2024-05-10T08:01:51Z

@Andrian0s I think your comments were not posted, they're not visible on my side

ah, thank you, should be visible now

jupyterjazz · 2024-05-10T08:12:55Z

As far as I know, this dataset spawns from FLORES, which is already covered from datasets likes SIB. Not a retrieval dataset, but still covers the languages

I'm sure the languages are covered if you consider all tasks/datasets on mteb, but from what I understand you get a bonus if the language is new for its task type, in this case for retrieval

Signed-off-by: jupyterjazz <[email protected]>

KennethEnevoldsen · 2024-05-11T15:44:08Z

@jupyterjazz, sorry for taking over this PR with a discussion on points. I believe you can merge this as long as it is acceptable to you that we might reduce the point at a later stage (following the discussion above).

Signed-off-by: jupyterjazz <[email protected]>

jupyterjazz · 2024-05-11T17:27:03Z

@KennethEnevoldsen, thanks for the review and the discussion. I changed my score to 100 based on this comment #636 (comment)

If everything looks good, please feel free to merge the PR as I'm not allowed to do it myself. Thanks again!

imenelydiaker · 2024-05-11T18:01:44Z

@KennethEnevoldsen, thanks for the review and the discussion. I changed my score to 100 based on this comment #636 (comment)

If everything looks good, please feel free to merge the PR as I'm not allowed to do it myself. Thanks again!

It should be 50 according to this rule: #666
@KennethEnevoldsen @isaac-chung

KennethEnevoldsen · 2024-05-12T11:55:11Z

@imenelydiaker let us keep it at 100 for existing contributions (as they started their contributions before the above discussion).

imenelydiaker

Everything looks good, let's merge! Thanks for your contribution!

jupyterjazz and others added 5 commits May 5, 2024 14:51

feat: belebele retrieval

270af8e

Signed-off-by: jupyterjazz <[email protected]>

feat: support langs

24c578d

Signed-off-by: jupyterjazz <[email protected]>

docs: adjust description

57ea3a6

Signed-off-by: jupyterjazz <[email protected]>

chore: results

8265dc0

Signed-off-by: jupyterjazz <[email protected]>

Merge branch 'main' into feat-belebele

aa53d78

imenelydiaker reviewed May 6, 2024

View reviewed changes

mteb/tasks/Retrieval/multilingual/BelebeleRetrieval.py Outdated Show resolved Hide resolved

mteb/tasks/Retrieval/multilingual/BelebeleRetrieval.py Outdated Show resolved Hide resolved

imenelydiaker self-assigned this May 6, 2024

jupyterjazz added 3 commits May 6, 2024 15:25

refactor: change num_samples and remove answers

bcda733

Signed-off-by: jupyterjazz <[email protected]>

chore: merge

abf7d7f

Signed-off-by: jupyterjazz <[email protected]>

chore: update results

0fe512d

Signed-off-by: jupyterjazz <[email protected]>

jupyterjazz requested a review from imenelydiaker May 6, 2024 13:54

Merge branch 'main' into feat-belebele

114d2d3

imenelydiaker reviewed May 7, 2024

View reviewed changes

mteb/tasks/Retrieval/multilingual/BelebeleRetrieval.py Outdated Show resolved Hide resolved

jupyterjazz and others added 4 commits May 7, 2024 22:31

refactor: update avg length

f5afb59

Signed-off-by: jupyterjazz <[email protected]>

chore: merge

3426591

Signed-off-by: jupyterjazz <[email protected]>

Merge branch 'main' into feat-belebele

a7c667c

chore: points

e5ea2b5

Signed-off-by: jupyterjazz <[email protected]>

jupyterjazz requested a review from imenelydiaker May 7, 2024 20:42

imenelydiaker reviewed May 7, 2024

View reviewed changes

Andrian0s self-requested a review May 9, 2024 22:32

Andrian0s requested changes May 10, 2024

View reviewed changes

Merge branch 'main' into feat-belebele

042bca2

jupyterjazz requested a review from Andrian0s May 10, 2024 08:13

jupyterjazz added 3 commits May 10, 2024 10:16

refactor: apply suggestions

0b38424

Signed-off-by: jupyterjazz <[email protected]>

chore: merge

e893f20

Signed-off-by: jupyterjazz <[email protected]>

fix: jsonl

80531e6

Signed-off-by: jupyterjazz <[email protected]>

style: linting

a90dcb6

Signed-off-by: jupyterjazz <[email protected]>

Andrian0s approved these changes May 10, 2024

View reviewed changes

Merge branch 'main' into feat-belebele

8e37a01

jupyterjazz and others added 2 commits May 11, 2024 19:21

Merge branch 'main' into feat-belebele

ea76ffc

chore: points

c04f102

Signed-off-by: jupyterjazz <[email protected]>

KennethEnevoldsen mentioned this pull request May 12, 2024

Update points system #666

Merged

Merge branch 'main' into feat-belebele

f9ec881

imenelydiaker approved these changes May 13, 2024

View reviewed changes

Merge branch 'main' into feat-belebele

644d22a

imenelydiaker enabled auto-merge (squash) May 13, 2024 13:02

imenelydiaker merged commit 4a2b9db into embeddings-benchmark:main May 13, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Belebele Retrieval #636

Add Belebele Retrieval #636

jupyterjazz commented May 5, 2024 •

edited

Loading

imenelydiaker May 7, 2024

jupyterjazz May 7, 2024

Andrian0s May 9, 2024 •

edited

Loading

imenelydiaker May 10, 2024

KennethEnevoldsen May 10, 2024

davidstap May 11, 2024

imenelydiaker May 11, 2024

KennethEnevoldsen May 11, 2024 •

edited

Loading

davidstap May 11, 2024

KennethEnevoldsen May 12, 2024

Andrian0s commented May 9, 2024 •

edited

Loading

jupyterjazz commented May 10, 2024

Andrian0s May 9, 2024 •

edited

Loading

Andrian0s commented May 10, 2024

jupyterjazz commented May 10, 2024

KennethEnevoldsen commented May 11, 2024

jupyterjazz commented May 11, 2024

imenelydiaker commented May 11, 2024

KennethEnevoldsen commented May 12, 2024

imenelydiaker left a comment

		@@ -0,0 +1,2 @@
		{"GitHub": "jupyterjazz", "New dataset": 326}

Add Belebele Retrieval #636

Add Belebele Retrieval #636

Conversation

jupyterjazz commented May 5, 2024 • edited Loading

Checklist for adding MMTEB dataset

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Andrian0s May 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KennethEnevoldsen May 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Andrian0s commented May 9, 2024 • edited Loading

jupyterjazz commented May 10, 2024

Andrian0s May 9, 2024 • edited Loading

Choose a reason for hiding this comment

Andrian0s commented May 10, 2024

jupyterjazz commented May 10, 2024

KennethEnevoldsen commented May 11, 2024

jupyterjazz commented May 11, 2024

imenelydiaker commented May 11, 2024

KennethEnevoldsen commented May 12, 2024

imenelydiaker left a comment

Choose a reason for hiding this comment

jupyterjazz commented May 5, 2024 •

edited

Loading

Andrian0s May 9, 2024 •

edited

Loading

KennethEnevoldsen May 11, 2024 •

edited

Loading

Andrian0s commented May 9, 2024 •

edited

Loading

Andrian0s May 9, 2024 •

edited

Loading