Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for all datasets of the GLUE benchmark #1710

Closed
8 tasks done
vcm2114 opened this issue May 11, 2022 · 6 comments
Closed
8 tasks done

Add support for all datasets of the GLUE benchmark #1710

vcm2114 opened this issue May 11, 2022 · 6 comments

Comments

@vcm2114
Copy link
Contributor

vcm2114 commented May 11, 2022

🚀 Feature

Add support for all 8 remaining datasets (SST-2 is already supported) of the GLUE benchmark: CoLA, MRPC, QQP, STS-B, MNLI, QNLI, RTE, WNLI.

Motivation

In itself adding support for all GLUE datasets has a lot of value: GLUE is one of the most widely used benchmark in the NLP community. Furthermore, our planned effort to develop a cohesive API for Multi-Task Learning requires us to enhance our suite our datasets, starting with GLUE.

Additional context

We have already created a streamlined dataset API, with a consistent use of DataPipes for dataset download and load operations, as well as a testing methodology relying on mock data (see #1493). This feature will only require to add support for these 8 datasets following that methodology.

Datasets

@parmeet
Copy link
Contributor

parmeet commented May 12, 2022

Thanks @VirgileHlav for creating this issue! Just a general comment regarding dataset implementation: let's avoid usage of lambda functions, please refer issue #1716 for details.

@vcm2114
Copy link
Contributor Author

vcm2114 commented May 12, 2022

Thanks @parmeet!

let's avoid usage of lambda functions

Acknowledged, I will make updates to the PRs.

@Nayef211
Copy link
Contributor

Thanks for creating this issue to keep track of the new datasets we are adding! One suggestion from my end would be to update our documentation for each new dataset we add here.

@parmeet
Copy link
Contributor

parmeet commented May 17, 2022

Let's also make sure we add shuffle and sharding filters: Reference Issue: #1727. Reference PR #1729

@vcm2114
Copy link
Contributor Author

vcm2114 commented May 18, 2022

Solved lint issues and added documentation + shuffle/sharding on all PRs.

@parmeet
Copy link
Contributor

parmeet commented Jun 13, 2022

closing issue, since all the datasets have been added :)

@parmeet parmeet closed this as completed Jun 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants