Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Belebele Retrieval #636

Merged
merged 23 commits into from
May 13, 2024

Conversation

jupyterjazz
Copy link
Contributor

@jupyterjazz jupyterjazz commented May 5, 2024

Checklist for adding MMTEB dataset

Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 115 distinct languages

  • I have tested that the dataset runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
  • I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

@imenelydiaker imenelydiaker self-assigned this May 6, 2024
Signed-off-by: jupyterjazz <[email protected]>
Signed-off-by: jupyterjazz <[email protected]>
@jupyterjazz jupyterjazz requested a review from imenelydiaker May 6, 2024 13:54
jupyterjazz and others added 4 commits May 7, 2024 22:31
@jupyterjazz jupyterjazz requested a review from imenelydiaker May 7, 2024 20:42
@@ -0,0 +1,2 @@
{"GitHub": "jupyterjazz", "New dataset": 326}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's 2 points for a New dataset even with multiple languages. You may add 4 points per new language if it's not already handled by another task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many points do you suggest I should add? 81 out of the 115 Belebele languages are not included in other retrieval datasets, that's why I added 2 + 81*4, I think this is what other PRs did as well

Copy link
Contributor

@Andrian0s Andrian0s May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, this dataset spawns from FLORES, which is already covered from datasets likes SIB. Not a retrieval dataset, but still covers the languages

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KennethEnevoldsen we need some help here with the points

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh this actually already happened back in #583 and we have had similar cases before (though never on this scale). We might consider changing the new dataset scores to be a max of 50-100. Both in terms on value added to the benchmark and effort I believe this is reasonable. We will have to update previous scores though.

@jupyterjazz how would you feel about such a change? We know that the points system is imperfect (always will be) and we don't want to discourage contributors who have put many hours into review etc. - I hope that the solution can strike the right balance.

Just checking in with @isaac-chung (reviewer, and large contributor) and @davidstap (previous contributor of large dataset) here as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the response @davidstap. The specific reason why I would like to reduce previous contributions as one dataset has obtained >3000 points for a single dataset contribution. We should have taken the discussion at the time, but there are many PRs and I understand that it wasn't caught.

Keeping the score as it is suggests to the contributors that the bibleNLP dataset is worth more than ALL the remaining contributions combined. I don't believe that is fair to the other contributors of this project so it is clearly something that we will have to change.

Again this should have been caught in review and I am sad that it wasn't. I hope we can find a solution that fits everyone.

I get your concern about one dataset contribution racking up a crazy number of points. It seems unfair when that dataset is worth more than all other contributions combined.

OTOH, I think massively multilingual datasets such as BibleNLP, but also FLORES and NTREX, are extremely valuable for a multilingual benchmark. They cover a ton of languages while keeping the domain and other confounders constant, which is crucial when one wants to test how well different languages perform in the same setting. When you're dealing with languages from different datasets, it's hard to compare performance because there are so many other confounding factors at play, like the topic/domain or length of the samples.

That’s why I think it’s much more valuable to have a single dataset with 500 languages, instead of 10 datasets with 50 distinct languages each, even if the set of languages is the same. The new point system (cap at 50/100 points) would not capture this at all. Of course no point system is perfect, but IMO this change would not be an improvement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTOH, I think massively multilingual datasets such as BibleNLP, but also FLORES and NTREX, are extremely valuable for a multilingual benchmark. They cover a ton of languages while keeping the domain and other confounders constant, which is crucial when one wants to test how well different languages perform in the same setting. When you're dealing with languages from different datasets, it's hard to compare performance because there are so many other confounding factors at play, like the topic/domain or length of the samples.

BibleNLP, Flores and NTREX can only be used in one task for a multilingual benchmark, although it's be nice to remind that the key idea behind MTEB is the ability to evaluate a model accross different tasks and languages. These datasets are cross-lingual, most of them are parallel texts that were built using translation, so basically they contain the same information across all languages.

Having distinct datasets with different languages is also valuable since we're not only focused on one style of writing or one source of data. It would clearly be unfair to reward a multilingual dataset more than other contributions like adding a new emmbedding evaluation task.

This new pointing system will not diminish your contribution, you'll still get 100 points for your two datasets which will get you in the top list of co-authors which is already good. It will just make it fair towards other contributors that have also put efforts into this.

Again sorry for the confusion, we should have fixed this earlier. But we're humans and we make mistakes 🙂

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen May 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s why I think it’s much more valuable to have a single dataset with 500 languages, instead of 10 datasets with 50 distinct languages each, even if the set of languages is the same. The new point system (cap at 50/100 points) would not capture this at all. Of course no point system is perfect, but IMO this change would not be an improvement.

of course, diversity of languages is not the only target of measure. Diversity of tasks is also important. E.g. A model performing well on bible datasets might perform well based on heuristics (e.g. does Jesus appear at the beginning of the sentence), but 50x datasets in 50 different languages cover different domains, tasks and sources and thus I would believe measures a broader spectrum (when we do a factor analysis of task x language scores I believe we will see the same thing as well).

With all of this being said. I would suggest that we set a limit to 50 points for the new datasets but raise the point limit to 100 points for already accepted PRs as well as ongoing PRs. I believe this strikes a fair balance between stating that multilingual datasets are extremely valuable while not overstating their value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imenelydiaker @KennethEnevoldsen thanks for your additional comments. It seems I have overstated the importance of extremely multilingual datasets in the context of mmteb. I think this stems from my background in multilingual MT, where these kinds of datasets are very valuable. Thanks again for organizing this community effort ☺️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for understanding @davidstap. Again we very much appreciate your contributions and this is by no means meant to understate it.

@Andrian0s Andrian0s self-requested a review May 9, 2024 22:32
@Andrian0s
Copy link
Contributor

Andrian0s commented May 9, 2024

I would agree we are ready to merge when we fix my minor comments and agree on points.

@jupyterjazz
Copy link
Contributor Author

@Andrian0s I think your comments were not posted, they're not visible on my side

@@ -0,0 +1,2 @@
{"GitHub": "jupyterjazz", "New dataset": 326}
Copy link
Contributor

@Andrian0s Andrian0s May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, this dataset spawns from FLORES, which is already covered from datasets likes SIB. Not a retrieval dataset, but still covers the languages

mteb/tasks/Retrieval/multilingual/BelebeleRetrieval.py Outdated Show resolved Hide resolved
mteb/tasks/Retrieval/multilingual/BelebeleRetrieval.py Outdated Show resolved Hide resolved
@Andrian0s
Copy link
Contributor

@Andrian0s I think your comments were not posted, they're not visible on my side

ah, thank you, should be visible now

@jupyterjazz
Copy link
Contributor Author

As far as I know, this dataset spawns from FLORES, which is already covered from datasets likes SIB. Not a retrieval dataset, but still covers the languages

I'm sure the languages are covered if you consider all tasks/datasets on mteb, but from what I understand you get a bonus if the language is new for its task type, in this case for retrieval

@jupyterjazz jupyterjazz requested a review from Andrian0s May 10, 2024 08:13
Signed-off-by: jupyterjazz <[email protected]>
Signed-off-by: jupyterjazz <[email protected]>
Signed-off-by: jupyterjazz <[email protected]>
Signed-off-by: jupyterjazz <[email protected]>
@KennethEnevoldsen
Copy link
Contributor

@jupyterjazz, sorry for taking over this PR with a discussion on points. I believe you can merge this as long as it is acceptable to you that we might reduce the point at a later stage (following the discussion above).

@jupyterjazz
Copy link
Contributor Author

@KennethEnevoldsen, thanks for the review and the discussion. I changed my score to 100 based on this comment #636 (comment)

If everything looks good, please feel free to merge the PR as I'm not allowed to do it myself. Thanks again!

@imenelydiaker
Copy link
Contributor

@KennethEnevoldsen, thanks for the review and the discussion. I changed my score to 100 based on this comment #636 (comment)

If everything looks good, please feel free to merge the PR as I'm not allowed to do it myself. Thanks again!

It should be 50 according to this rule: #666
@KennethEnevoldsen @isaac-chung

@KennethEnevoldsen
Copy link
Contributor

@imenelydiaker let us keep it at 100 for existing contributions (as they started their contributions before the above discussion).

Copy link
Contributor

@imenelydiaker imenelydiaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good, let's merge! Thanks for your contribution!

@imenelydiaker imenelydiaker enabled auto-merge (squash) May 13, 2024 13:02
@imenelydiaker imenelydiaker merged commit 4a2b9db into embeddings-benchmark:main May 13, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants