Add support for all datasets of the GLUE benchmark #1710

vcm2114 · 2022-05-11T16:04:01Z

🚀 Feature

Add support for all 8 remaining datasets (SST-2 is already supported) of the GLUE benchmark: CoLA, MRPC, QQP, STS-B, MNLI, QNLI, RTE, WNLI.

Motivation

In itself adding support for all GLUE datasets has a lot of value: GLUE is one of the most widely used benchmark in the NLP community. Furthermore, our planned effort to develop a cohesive API for Multi-Task Learning requires us to enhance our suite our datasets, starting with GLUE.

Additional context

We have already created a streamlined dataset API, with a consistent use of DataPipes for dataset download and load operations, as well as a testing methodology relying on mock data (see #1493). This feature will only require to add support for these 8 datasets following that methodology.

Datasets

parmeet · 2022-05-12T03:03:38Z

Thanks @VirgileHlav for creating this issue! Just a general comment regarding dataset implementation: let's avoid usage of lambda functions, please refer issue #1716 for details.

vcm2114 · 2022-05-12T14:51:49Z

Thanks @parmeet!

let's avoid usage of lambda functions

Acknowledged, I will make updates to the PRs.

Nayef211 · 2022-05-17T03:57:17Z

Thanks for creating this issue to keep track of the new datasets we are adding! One suggestion from my end would be to update our documentation for each new dataset we add here.

parmeet · 2022-05-17T14:41:49Z

Let's also make sure we add shuffle and sharding filters: Reference Issue: #1727. Reference PR #1729

vcm2114 · 2022-05-18T19:53:44Z

Solved lint issues and added documentation + shuffle/sharding on all PRs.

parmeet · 2022-06-13T20:15:24Z

closing issue, since all the datasets have been added :)

This was referenced May 12, 2022

Add support for RTE dataset with unit tests #1721

Merged

Add support for WNLI dataset with unit tests #1724

Merged

parmeet closed this as completed Jun 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for all datasets of the GLUE benchmark #1710

Add support for all datasets of the GLUE benchmark #1710

vcm2114 commented May 11, 2022 •

edited by Nayef211

Loading

parmeet commented May 12, 2022

vcm2114 commented May 12, 2022

Nayef211 commented May 17, 2022

parmeet commented May 17, 2022

vcm2114 commented May 18, 2022

parmeet commented Jun 13, 2022

Add support for all datasets of the GLUE benchmark #1710

Add support for all datasets of the GLUE benchmark #1710

Comments

vcm2114 commented May 11, 2022 • edited by Nayef211 Loading

🚀 Feature

Datasets

parmeet commented May 12, 2022

vcm2114 commented May 12, 2022

Nayef211 commented May 17, 2022

parmeet commented May 17, 2022

vcm2114 commented May 18, 2022

parmeet commented Jun 13, 2022

vcm2114 commented May 11, 2022 •

edited by Nayef211

Loading