-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for all datasets of the GLUE benchmark #1710
Comments
Thanks @VirgileHlav for creating this issue! Just a general comment regarding dataset implementation: let's avoid usage of lambda functions, please refer issue #1716 for details. |
Thanks @parmeet!
Acknowledged, I will make updates to the PRs. |
Thanks for creating this issue to keep track of the new datasets we are adding! One suggestion from my end would be to update our documentation for each new dataset we add here. |
Solved lint issues and added documentation + shuffle/sharding on all PRs. |
closing issue, since all the datasets have been added :) |
🚀 Feature
Add support for all 8 remaining datasets (SST-2 is already supported) of the GLUE benchmark: CoLA, MRPC, QQP, STS-B, MNLI, QNLI, RTE, WNLI.
Motivation
In itself adding support for all GLUE datasets has a lot of value: GLUE is one of the most widely used benchmark in the NLP community. Furthermore, our planned effort to develop a cohesive API for Multi-Task Learning requires us to enhance our suite our datasets, starting with GLUE.
Additional context
We have already created a streamlined dataset API, with a consistent use of DataPipes for dataset download and load operations, as well as a testing methodology relying on mock data (see #1493). This feature will only require to add support for these 8 datasets following that methodology.
Datasets
The text was updated successfully, but these errors were encountered: