This repository harbors the scripts for handling the data that is part of the CLIN28 shared task on spelling correction.
Automatic spell checking and correction has been subject of research for decades. Although state of the art spell checkers perform reasonably well for everyday-life applications, reaching high accuracy remains to be a challenging task. This shared task focuses on the detection and correction of spelling errors in Dutch Wikipedia texts. Wikipedia articles aim to be standard-Dutch texts, which may contain jargon. In particular, this task addresses the detection and correction of the types of spelling errors listed below:
We initially deliver one annotated document for validation purposes. A validation set consisting of 50 Wikipedia articles will follow before the end of October. The documents may contain zero, one or more spelling errors. The validation set contains all of the spelling error categories listed below. In December, a full test set will be published in the same format.
We deliver the trial set, the test set, and eventually the gold-standard reference in two formats: FoLiA XML and a JSON format. This JSON representation is automatically derived from the FoLiA documents and acts as a simplified format for this task to make it more accessible and not place an unnecessarily high burden on document parsing. It can act as input to your system as it contains all vital information, however, it is not as rich as the original FoLiA document.
Likewise, your system may output either FoLiA XML or our JSON format. In either case, it is important to ensure your
output is valid by using the validator tools we provide. For FoLiA use the foliavalidator
tool (part of
https://github.com/proycon/folia), for JSON, use the validator provided in this repository.
All data will be delivered to you in tokenised form. Tokenisation has been conducted using ucto. You're expected to adhere to this tokenisation, the data formats have special facilities for merges, splits, insertions and deletions of tokens, as may naturally arise in spelling correction.
Familiarity with JSON is assumed; we will merely state the specifics of our representation. At the root level, we have
words
and corrections
. Words contains a list of all words/tokens along with their ID and some other information.
Corrections contains a list of corrections on those words, this will be provided for the trial data and for the
gold-standard release after the task's end. For the test data, it will be an empty list which your system is expected to
fill. Consider the following example:
{
"words": [
{ "text": "Dit", "id": "word.1", "space": true, "in": "sentence.1" },
{ "text": "is", "id": "word.2", "space": true, "in": "sentence.1" },
{ "text": "een", "id": "word.3", "space": true, "in": "sentence.1" },
{ "text": "vooorbeeld", "id": "word.4", "space": false, "in": "sentence.1" },
{ "text": ".", "id": "word.5", "space": true, "in": "sentence.1" }
],
"corrections": [
{ "class": "nonworderror", "span": ["word.4"], "text": "voorbeeld" }
]
}
This example shows one correction.
Word Specification:
text
- The text of the word/token, a stringid
- The ID of the word/token (string). This is used to refer back to the token. Note that although the ID often has implicit numbering indicating ordering, this is NOT guaranteed. The order of the words should be derived from the order they appear in thewords
list only. IDs are case sensitive!space
- A boolean indicating whether the word/token is followed by a space. This can be used to reconstructed the text prior to tokenisation.in
- This refers to the ID of the structural element in which the word occurs, almost always a sentence. Sentence breaks can be detected by changes in this value. For more structural information, you'll need the original FoLiA documents.
Correction specification:
class
- The type of the error; should be one of the classes defined in our set definition (use the IDs, not the labels!). These are case sensitive.span
- A list of word IDs to which this correction applies.text
- The text of the correction, i.e. the new word(s). This text may be an empty string in case of a deletion (e.g. redundant word/punctuation), or may consist of multiple space separated words in case of a run-on error (for example naarhuis -> naar huis).after
- Should be used instead ofspan
in cases of an insertion (insertion of a new word/token where previously none existed). The value is a string and is the ID of the word after which the correction is to be inserted.
Note that all JSON for this task should be UTF-8 encoded.
The JSON option is the simpler and sufficient option for this task. But if you want to leverage the full information available in the input document, you can fall back to use the original FoLiA input.
The FoLiA format is extensively documented; consult the FoLiA website, we particularly refer to section 2.10.8 on corrections. Python users may benefit from using our Python FoLiA library, part of pynlpl and documented here.
The FoLiA documents may also act as a source for further linguistic enrichment using FoLiA-aware tools such as frog.
Detection and correction of spelling errors in the (to be released) test documents are evaluated separately, in terms of precision, recall and F-score. The script for automatic evaluation of the submissions will be published as soon as possible in this repository.
- 31 October 2017: validation data set and Valkuil demonstration online
- 1 December: test data online
- 22 December 2017: deadline for submission of source code and output
- 8 January 2018: feedback to submissions
- 26 January 2018: presenting the results at the CLIN conference
real-word confusions
, word is confused with a near neighbor (confusion with non-native spelling, homophony, grammatical errors, et cetera):- ik wordt → ik word
- stijl → steil
- hobbies → hobby’s
- me → mijn
- als → dan
split errors
, compound words which are incorrectly separated:- beleids medewerker → beleidsmedewerker
- lang durig → langdurig
runon errors
, incorrect concatenation of words:- etcetera → et cetera
- zeidat → zei dat
missing words
, sentence is ungrammatical due to missing elements:- samen met vrouw die → samen met de vrouw die
redundant words
, sentence is ungrammatical due to redundant elements:- door doordat → doordat
missing punctuation
, missing diacritical symbols and hyphenation marks (other cases of missing punctuation are excluded from the task):- een en ander → één en ander
- financiele → financiële
- autoongeluk → auto-ongeluk
redundant punctuation
, redundant diacritical symbols and hyphenation marks (other cases of redundant punctuation are excluded from the task):- financiëel → financieel
- co-assistent → coassistent
capitalisation errors
, incorrect use of capital letters:- Joodse → joodse
- Minister van Onderwijs → minister van Onderwijs
- amstelveen → Amstelveen
archaic spelling
, outdated spelling:- aktie → actie
- paardebloem → paardenbloem
non-word errors
, words that do not exist in Dutch:- voek → boek
- assrtief → assertief