Skip to content
/ HuWNLI Public

Anaphora resolution datasets for Hungarian as an inference task

License

Notifications You must be signed in to change notification settings

nytud/HuWNLI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 

Repository files navigation

HuWNLI

Anaphora resolution datasets for Hungarian formulated as an inference task

This is the repository for the Hungarian datasets of anaphora resolution, designed as a sentence pair classification task of natural language inference. As a first version, the repo contains the translation of the Winograd schemata formatted as an inference task. A Winograd schema is a pair of sentences that differ in only one or two words and that contain an ambiguity that is resolved in opposite ways in the two sentences and requires the use of world knowledge and reasoning for its resolution (Levesque et al. 2012). This dataset is also part of the Hungarian Language Understanding Evaluation Benchmark Kit HuLU.

The HuWS corpus was created by translating and manually curating the original English Winograd schemata. The NLI format published here was created by replacing the ambiguous pronoun with each possible referent in the schemata (the method is described in GLUE's paper, Wang et al. 2019). We extended the set of sentence pairs derived from the schemata by the translation of the sentence pairs that - together with the Winograd schema sentences - build up the WNLI dataset of GLUE.

Dataset Structure

Data Instances

For each instance, there is an orig_id, an id, two sentences and a label.

An example:

{"orig_id": "4",
 "id": "4",
 "sentence1": "A férfi nem tudta felemelni a fiát, mert olyan nehéz volt.",
 "sentence2": "A fia nehéz volt.",
 "Label": "1"
}

Data Fields

  • orig_id: the original id of this sentence pair (more precisely, its English counterpart's) in GLUE's WNLI dataset;

  • id: unique id of the instances;

  • sentence1: the premise;

  • sentence2: the hypothesis;

  • Label: "1" if sentence2 is entailed by sentence1, and "0" otherwise.

The data is distributed in three splits: training set (562), development set (59) and test set (134). The splits follow GLUE's WNLI's splits, but contain less instances as many sentence pairs had to be thrown away for being untranslatable to Hungarian. The test set's sentence pairs are translated from GLUE's WNLI's test set. This set was distributed without labels. 3 annotators annotated the Hungarian sentence pairs. The test set of HuWNLI is also distributed without labels. To evaluate your model, please contact us, or check HuLU's website for an automatic evaluation (this feature is under construction at the moment).

Licensing Information

HuWNLI is released under the CC-BY-SA-4.0 License.

Citation Information

If you use this resource or any part of its documentation, please refer to:

Vadász, N., & Ligeti-Nagy, N. (2022). Winograd schemata and other datasets for anaphora resolution in Hungarian, Acta Linguistica Academica, 69(4), in press.

@article{vadaszligeti2022actawinograd,
      author = "Noémi Vadász and Noémi Ligeti-Nagy",
      title = "Winograd schemata and other datasets for anaphora resolution in Hungarian",
      journal = "Acta Linguistica Academica",
      year = "2022",
      publisher = "Akadémiai Kiadó",
      address = "Budapest, Hungary",
      volume = "69",
      number = "4",
      note = "In press."
}

and to:

Levesque, Hector, Davis, Ernest, Morgenstern, Leora (2012) The winograd schema challenge. In: Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning.

@inproceedings{levesque2012winograd,
  title={The Winograd Schema Challenge},
  author={Levesque, Hector and Davis, Ernest and Morgenstern, Leora},
  booktitle={Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning},
  year={2012},
  organization={Citeseer}
}

About

Anaphora resolution datasets for Hungarian as an inference task

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published