-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Belebele Retrieval #636
Add Belebele Retrieval #636
Conversation
Signed-off-by: jupyterjazz <[email protected]>
Signed-off-by: jupyterjazz <[email protected]>
Signed-off-by: jupyterjazz <[email protected]>
Signed-off-by: jupyterjazz <[email protected]>
Signed-off-by: jupyterjazz <[email protected]>
Signed-off-by: jupyterjazz <[email protected]>
Signed-off-by: jupyterjazz <[email protected]>
Signed-off-by: jupyterjazz <[email protected]>
Signed-off-by: jupyterjazz <[email protected]>
Signed-off-by: jupyterjazz <[email protected]>
docs/mmteb/points/636.jsonl
Outdated
@@ -0,0 +1,2 @@ | |||
{"GitHub": "jupyterjazz", "New dataset": 326} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's 2 points for a New dataset even with multiple languages. You may add 4 points per new language if it's not already handled by another task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many points do you suggest I should add? 81 out of the 115 Belebele languages are not included in other retrieval datasets, that's why I added 2 + 81*4, I think this is what other PRs did as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, this dataset spawns from FLORES, which is already covered from datasets likes SIB. Not a retrieval dataset, but still covers the languages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KennethEnevoldsen we need some help here with the points
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh this actually already happened back in #583 and we have had similar cases before (though never on this scale). We might consider changing the new dataset scores to be a max of 50-100. Both in terms on value added to the benchmark and effort I believe this is reasonable. We will have to update previous scores though.
@jupyterjazz how would you feel about such a change? We know that the points system is imperfect (always will be) and we don't want to discourage contributors who have put many hours into review etc. - I hope that the solution can strike the right balance.
Just checking in with @isaac-chung (reviewer, and large contributor) and @davidstap (previous contributor of large dataset) here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the response @davidstap. The specific reason why I would like to reduce previous contributions as one dataset has obtained >3000 points for a single dataset contribution. We should have taken the discussion at the time, but there are many PRs and I understand that it wasn't caught.
Keeping the score as it is suggests to the contributors that the bibleNLP dataset is worth more than ALL the remaining contributions combined. I don't believe that is fair to the other contributors of this project so it is clearly something that we will have to change.
Again this should have been caught in review and I am sad that it wasn't. I hope we can find a solution that fits everyone.
I get your concern about one dataset contribution racking up a crazy number of points. It seems unfair when that dataset is worth more than all other contributions combined.
OTOH, I think massively multilingual datasets such as BibleNLP, but also FLORES and NTREX, are extremely valuable for a multilingual benchmark. They cover a ton of languages while keeping the domain and other confounders constant, which is crucial when one wants to test how well different languages perform in the same setting. When you're dealing with languages from different datasets, it's hard to compare performance because there are so many other confounding factors at play, like the topic/domain or length of the samples.
That’s why I think it’s much more valuable to have a single dataset with 500 languages, instead of 10 datasets with 50 distinct languages each, even if the set of languages is the same. The new point system (cap at 50/100 points) would not capture this at all. Of course no point system is perfect, but IMO this change would not be an improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OTOH, I think massively multilingual datasets such as BibleNLP, but also FLORES and NTREX, are extremely valuable for a multilingual benchmark. They cover a ton of languages while keeping the domain and other confounders constant, which is crucial when one wants to test how well different languages perform in the same setting. When you're dealing with languages from different datasets, it's hard to compare performance because there are so many other confounding factors at play, like the topic/domain or length of the samples.
BibleNLP, Flores and NTREX can only be used in one task for a multilingual benchmark, although it's be nice to remind that the key idea behind MTEB is the ability to evaluate a model accross different tasks and languages. These datasets are cross-lingual, most of them are parallel texts that were built using translation, so basically they contain the same information across all languages.
Having distinct datasets with different languages is also valuable since we're not only focused on one style of writing or one source of data. It would clearly be unfair to reward a multilingual dataset more than other contributions like adding a new emmbedding evaluation task.
This new pointing system will not diminish your contribution, you'll still get 100 points for your two datasets which will get you in the top list of co-authors which is already good. It will just make it fair towards other contributors that have also put efforts into this.
Again sorry for the confusion, we should have fixed this earlier. But we're humans and we make mistakes 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s why I think it’s much more valuable to have a single dataset with 500 languages, instead of 10 datasets with 50 distinct languages each, even if the set of languages is the same. The new point system (cap at 50/100 points) would not capture this at all. Of course no point system is perfect, but IMO this change would not be an improvement.
of course, diversity of languages is not the only target of measure. Diversity of tasks is also important. E.g. A model performing well on bible datasets might perform well based on heuristics (e.g. does Jesus appear at the beginning of the sentence), but 50x datasets in 50 different languages cover different domains, tasks and sources and thus I would believe measures a broader spectrum (when we do a factor analysis of task x language scores I believe we will see the same thing as well).
With all of this being said. I would suggest that we set a limit to 50 points for the new datasets but raise the point limit to 100 points for already accepted PRs as well as ongoing PRs. I believe this strikes a fair balance between stating that multilingual datasets are extremely valuable while not overstating their value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imenelydiaker @KennethEnevoldsen thanks for your additional comments. It seems I have overstated the importance of extremely multilingual datasets in the context of mmteb. I think this stems from my background in multilingual MT, where these kinds of datasets are very valuable. Thanks again for organizing this community effort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for understanding @davidstap. Again we very much appreciate your contributions and this is by no means meant to understate it.
I would agree we are ready to merge when we fix my minor comments and agree on points. |
@Andrian0s I think your comments were not posted, they're not visible on my side |
docs/mmteb/points/636.jsonl
Outdated
@@ -0,0 +1,2 @@ | |||
{"GitHub": "jupyterjazz", "New dataset": 326} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, this dataset spawns from FLORES, which is already covered from datasets likes SIB. Not a retrieval dataset, but still covers the languages
ah, thank you, should be visible now |
I'm sure the languages are covered if you consider all tasks/datasets on mteb, but from what I understand you get a bonus if the language is new for its task type, in this case for retrieval |
Signed-off-by: jupyterjazz <[email protected]>
Signed-off-by: jupyterjazz <[email protected]>
Signed-off-by: jupyterjazz <[email protected]>
Signed-off-by: jupyterjazz <[email protected]>
@jupyterjazz, sorry for taking over this PR with a discussion on points. I believe you can merge this as long as it is acceptable to you that we might reduce the point at a later stage (following the discussion above). |
Signed-off-by: jupyterjazz <[email protected]>
@KennethEnevoldsen, thanks for the review and the discussion. I changed my score to 100 based on this comment #636 (comment) If everything looks good, please feel free to merge the PR as I'm not allowed to do it myself. Thanks again! |
It should be 50 according to this rule: #666 |
@imenelydiaker let us keep it at 100 for existing contributions (as they started their contributions before the above discussion). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good, let's merge! Thanks for your contribution!
Checklist for adding MMTEB dataset
Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 115 distinct languages
mteb
package.mteb run -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.438.jsonl
).