Contextual word checker for better suggestions
It is essential to understand that identifying whether a candidate is a spelling error is a big task. You can see the below quote from a research paper:
Spelling errors are broadly classified as non- word errors (NWE) and real word errors (RWE). If the misspelt string is a valid word in the language, then it is called an RWE, else it is an NWE.
This package currently focuses on Out of Vocabulary (OOV) word or non-word error (NWE) correction using BERT model. The idea of using BERT was to use the context when correcting OOV. In the coming days, I would like to focus on RWE and optimising the package by implementing it in cython.
The package can be installed using pip. You would require python 3.6+
pip install contextualSpellCheck
Also, please install the dependencies from requirements.txt
Note: For other language examples check examples
>>> import contextualSpellCheck
>>> import spacy
>>> ## We require NER to identify if it is PERSON
>>> ## also require parser because we use Token.sent for context
>>> nlp = spacy.load("en_core_web_sm")
>>> contextualSpellCheck.add_to_pipe(nlp)
<spacy.lang.en.English object at 0x12839a2d0>
>>> nlp.pipe_names
['tagger', 'parser', 'ner', 'contextual spellchecker']
>>> doc = nlp('Income was $9.4 milion compared to the prior year of $2.7 milion.')
>>> doc._.outcome_spellCheck
'Income was $9.4 million compared to the prior year of $2.7 million.'
Or you can add to spaCy pipeline manually!
>>> import spacy
>>> import contextualSpellCheck
>>> nlp = spacy.load('en')
>>> checker = contextualSpellCheck.contextualSpellCheck.ContextualSpellCheck()
>>> nlp.add_pipe(checker)
>>> doc = nlp("Income was $9.4 milion compared to the prior year of $2.7 milion.")
>>> print(doc._.performed_spellCheck)
>>> print(doc._.outcome_spellCheck)
Income was $9.4 million compared to the prior year of $2.7 million.
After adding contextual spell checker in the pipeline, you use the pipeline normally. The spell check suggestions and other data can be accessed using extensions.
>>> doc = nlp(u'Income was $9.4 milion compared to the prior year of $2.7 milion.')
>>> # Doc Extention
>>> print(doc._.contextual_spellCheck)
>>> print(doc._.performed_spellCheck)
>>> print(doc._.suggestions_spellCheck)
{milion: 'million', milion: 'million'}
>>> print(doc._.outcome_spellCheck)
Income was $9.4 million compared to the prior year of $2.7 million.
>>> print(doc._.score_spellCheck)
{milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)], milion: [('billion', 0.65934), ('million', 0.26185), ('trillion', 0.05391), ('##M', 0.0051), ('Million', 0.00425), ('##B', 0.00268), ('USD', 0.00153), ('##b', 0.00077), ('millions', 0.00059), ('%', 0.00041)]}
>>> # Token Extention
>>> print(doc[4]._.get_require_spellCheck)
>>> print(doc[4]._.get_suggestion_spellCheck)
>>> print(doc[4]._.score_spellCheck)
[('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)]
>>> # Span Extention
>>> print(doc[2:6]._.get_has_spellCheck)
>>> print(doc[2:6]._.score_spellCheck)
{$: [], 9.4: [], milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)], compared: []}
To make the usage simpler spacy provides custom extensions which a library can use. This makes it easier for the user to get the desired data. contextualSpellCheck provides extensions on the doc
, span
and token
level. Below tables summaries the extensions.
Extension | Type | Description | Default |
doc._.contextual_spellCheck | Boolean |
To check whether contextualSpellCheck is added as extension | True |
doc._.performed_spellCheck | Boolean |
To check whether contextualSpellCheck identified any misspells and performed correction | False |
doc._.suggestions_spellCheck | {Spacy.Token:str} |
if corrections are performed, it returns the mapping of misspell token (spaCy.Token ) with suggested word(str ) |
{} |
doc._.outcome_spellCheck | str |
corrected sentence(str ) as output |
"" |
doc._.score_spellCheck | {Spacy.Token:List(str,float)} |
if corrections are identified, it returns the mapping of misspell token (spaCy.Token ) with suggested words(str ) and probability of that correction |
None |
Extension | Type | Description | Default |
span._.get_has_spellCheck | Boolean |
To check whether contextualSpellCheck identified any misspells and performed correction in this span | False |
span._.score_spellCheck | {Spacy.Token:List(str,float)} |
if corrections are identified, it returns the mapping of misspell token (spaCy.Token ) with suggested words(str ) and probability of that correction for tokens in this span |
{spaCy.Token: []} |
Extension | Type | Description | Default |
token._.get_require_spellCheck | Boolean |
To check whether contextualSpellCheck identified any misspells and performed correction on this token |
False |
token._.get_suggestion_spellCheck | str |
if corrections are performed, it returns the suggested word(str ) |
"" |
token._.score_spellCheck | [(str,float)] |
if corrections are identified, it returns suggested words(str ) and probability(float ) of that correction |
[] |
At present, there is a simple GET API to get you started. You can run the app in your local and play with it.
Query: You can use the endpoint Note: Your browser can handle the text encoding
"success": true,
"input": "Income was $9.4 milion compared to the prior year of $2.7 milion.",
"corrected": "Income was $9.4 milion compared to the prior year of $2.7 milion.",
"suggestion_score": {
"milion": [
"milion:1": [
- dependency version in (#38)
- use cython for part of the code to improve performance (#39)
- Improve metric for candidate selection (#40)
- Add examples for other langauges (#41)
- Update the logic of misspell identification (OOV) (#44)
- better candidate generation (solved by #44?)
- add metric by testing on datasets
- Improve documentation
- Improve logging in code
- Add support for Real Word Error (RWE) (Big Task)
- add multi mask out capability
Completed Task
- specify maximum edit distance for
- allow user to specify bert model
- Include transformers deTokenizer to get better suggestions
If you like the project, please ⭑ the project and show your support! Also, if you feel, the current behaviour is not as expected, please feel free to raise an issue. If you can help with any of the above tasks, please open a PR with necessary changes to documentation and tests.
Below are some of the projects/work I referred to while developing this package
- Explosion AI.Architecture. May 2020. url:
- Monojit Choudhury et al. “How difficult is it to develop a perfect spell-checker? A cross-linguistic analysis through complex network approach”. In:arXiv preprint physics/0703198(2007).
- Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transform-ers for Language Understanding. 2019. arXiv:1810.04805 [cs.CL].
- Hugging Face.Fast Coreference Resolution in spaCy with Neural Net-works. May 2020. url:
- Ines.Chapter 3: Processing Pipelines. May 20202. url:
- Eric Mays, Fred J Damerau, and Robert L Mercer. “Context based spellingcorrection”. In:Information Processing & Management27.5 (1991), pp. 517–522.
- Peter Norvig. How to Write a Spelling Corrector. May 2020. url:
- Yifu Sun and Haoming Jiang.Contextual Text Denoising with MaskedLanguage Models. 2019. arXiv:1910.14080 [cs.CL].
- Thomas Wolf et al. “Transformers: State-of-the-Art Natural LanguageProcessing”. In:Proceedings of the 2020 Conference on Empirical Methodsin Natural Language Processing: System Demonstrations. Online: Associ-ation for Computational Linguistics, Oct. 2020, pp. 38–45. url: