This project aims to explore and evaluate existing spell checking tools on various datasets.
- Birkbeck Spelling Error Corpus - This dataset, developed by the University of London, contains a collection of spelling errors commonly made by English speakers. It provides valuable examples of erroneous word forms and their correct counterparts, making it ideal for testing and training spell-checking tools.
- Holbrook Corpus - This corpus includes English sentences annotated with common spelling mistakes and their corrections. The dataset allows for a comprehensive study of spelling errors in real-world contexts, helping to refine spelling correction models by providing a variety of sentence structures.
Note: data in data/holbrook/data.dat
has been modified by removing first 21 rows from the original dataset (first 20 rows describe words which had no targets)
- Aspell Testing Corpus - Derived from the Aspell spell-checker, this dataset consists of a list of misspelled words and their correct versions. The data is particularly useful for benchmarking spell-checking tools, as it includes a broad range of typographical errors commonly encountered in everyday text.
- Wikipedia Misspelings Dataset - Compiled from Wikipedia entries, this dataset includes frequently observed misspellings from a large online corpus. It captures the kinds of spelling errors users make on public platforms, aiding in developing models that perform well in diverse and noisy text environments.
- English Sentences (with randomly introduced errors) - This Kaggle dataset contains English sentences where errors have been artificially introduced at random (code used for introducing errors is available at
data/sentences/gen_errors.py
). It provides a controlled environment to assess spell-checking performance across varying error types and sentence contexts, simulating real-world typos and mistakes.
- pyspellchecker - python-based spell-checking tool that utilizes Levenshtein distance to identify and suggest corrections for misspelled words.
- Hunspell - an open-source spell checker widely used in applications like LibreOffice, OpenOffice, Firefox, and Chrome. It supports complex languages and morphological structures, handling compound words and allowing custom dictionaries. For Python intergration, I've used library
pyhunspell
available here. - Vennify's T5 Grammar Correction transformer - a T5-based transformer model trained specifically for grammar correction, available on Hugging Face. This model leverages deep learning to handle a wide range of language errors, making it suitable for complex correction tasks that go beyond basic spelling.
- SymSpell - known for its speed, SymSpell uses a dictionary-based approach to provide fast, memory-efficient spell correction. It’s ideal for large datasets or applications where rapid error detection and correction are critical.
- TextBlob - a Python library for processing textual data, providing spell-checking as part of its toolkit.
- Latency - Indicates the time taken by the model to process and output corrections for each input, essential for evaluating the model's efficiency, especially in real-time or resource-constrained applications.
- Accuracy - Measures the proportion of correctly predicted corrections among all predictions, giving a general sense of the model’s performance in identifying and correcting errors.
- Precision - Focuses on the accuracy of the corrections the model suggests, defined as the proportion of correctly corrected errors out of all the corrections made by the model. This metric is important for minimizing false positives.
- Recall - Represents the model's ability to catch all actual errors by indicating the proportion of correctly corrected errors out of the total errors present. High recall is essential for ensuring that the model identifies as many errors as possible.
-
$F_{1}$ -score - A harmonic mean of Precision and Recall, this metric balances the trade-off between these two metrics, providing a single score that reflects both the correctness of the corrections and the model’s ability to capture errors comprehensively.
Overview of Results (based on average_values.csv
):
Tool | Avg. Latency [ms] | Avg. Precision | Avg. Recall | Avg. F1-score | Avg. Accuracy |
---|---|---|---|---|---|
SymSpell | 2.66614 | 0.53245 | 0.53249 | 0.53247 | 0.54795 |
TextBlob | 62.49305 | 0.50099 | 0.50152 | 0.50125 | 0.53602 |
Hunspell | 45.16548 | 0.42285 | 0.42366 | 0.42325 | 0.47944 |
Pyspell | 112.72535 | 0.53438 | 0.53424 | 0.53431 | 0.54620 |
T5 | 1553.49168 | 0.19368 | 0.23467 | 0.20449 | 0.33033 |
- Fastest models:
symspell
has the lowest latency, with times ranging from ~0.3 seconds (aspell.csv
,wikipedia.csv
) to ~12 seconds (sentences.csv
).hunspell
also performs relatively quickly, with latencies generally around 20–25 seconds, except forsentences.csv
, where latency spikes to ~141 seconds. - Slowest models:
T5
is consistently the slowest, with latencies exceeding 1500 seconds across datasets. This shows that T5’s deep learning-based approach, while more powerful, requires significantly more processing time.
For applications needing real-time or faster spell-checking, symspell
or hunspell
would be a good choice, whereas T5
models may be more suitable where time is not a critical factor.
- Highest accuracy models:
textblob
achieves the highest accuracy onsentences.csv
(0.82) and is generally competitive on other datasets.symspell
also performs well, especially on the same dataset, reaching ~0.73 accuracy. - Lowest accuracy models:
T5
has the lowest accuracy in most datasets, with the highest onsentences.csv
(0.73) but struggling onaspell
,birkbeck
,holbrook
, andwikipedia
datasets, likely due to more complex errors.
-
Top performers:
symspell
,textblob
, andpyspell
perform better in precision and recall across most datasets.symspell
shows balanced precision and recall, especially insentences.csv
, where it achieves over 0.8 in both metrics. -
Low scoring models:
T5
performs poorly in datasets likeaspell.csv
,birkbeck.csv
, andholbrook.csv
, with very low precision and recall. However, it performs better onsentences.csv
(precision ~0.81), possibly because this dataset has less complex corrections.
- if looking for speed and accuracy
symspell
andtextblob
seem like the best choices, - the
t5
model, though slower and less accurate, might be more appropiate for more subtle language tasks, given additional tuning, hunspell
offers support for more complex language structures like compounds and custom dictionaries which might make it a better choice in more sophisticated language tasks even though its' recall and F1 score aren't as high astextxblob
,
-
symspell
:- Strengths:
- Low latency,
- good for common errors,
- Weaknesses:
- limited vocabulary,
- low precision and recall on complex errors,
- Possible improvements:
- expand the dictionary to include more complex terms,
- incorporate context-based corrections
- Strengths:
-
textblob
:- Strengths:
- higher level of language understanding,
- simple to use,
- Weaknesses:
- higher latency than
symspell
, especially on bigger datasets, - worse recall,
- higher latency than
- Possible improvements:
- expand model's vocabulary with more domain-specific terms,
- parallelize code or rewrite some of its' parts in C/C++ for better performance
- Strengths:
-
hunspell
:- Strengths:
- extensive dictionary,
- allows customization (e.g. using user-defined dictionaries),
- not that complex to use,
- Weaknesses:
- high false positive rate,
- low/moderate latency
- Possible improvements:
- expand dictionary with more domain-specific vocabulary,
- incorporate contextual spell checking
- Strengths:
-
pyspell
:- Strengths:
- good performance on basic datasets,
- simple to use,
- Weaknesses:
- high latency,
- Possible improvements:
- rewrite crucial algorithms in C/C++ for better performance,
- use different ML techniques for correcting more sophisticated errors,
- Strengths:
-
t5
:- Strengths:
- contextual understanding of text,
- flexibility (can be fine-tuned for more domain-specific language),
- Weaknesses:
- very high latency,
- low performance on specific datasets,
- Possible improvements:
- try distillation techniques to reduce model size while preserving most of its' performance,
- fine-tune with more data
- Strengths:
In order to test the models I've chosen a diverse set of different well-known corpora as well as synthetic datasets created by randomly introducing errors. This variety was crucial for assessing how each model performed under different conditions.
I selected several popular spell-checking tools like pyspellchecker
, Hunspell
, SymSpell
, TextBlob
, and Vennify’s T5 Grammar Correction transformer
. Each of these tools has different strengths, from basic spell-checking to handling complex languages, allowing for a comprehensive comparison.
I've chosen accuracy, latency, precision, recall, and F1-score since they are commonly employed in research across various fields, particularly in machine learning, natural language processing, and spell-checking systems.
I've tracked each model's outputs on the prepared data and calculated the number of true positives, false positives, true negatives and false negatives to compute necessary metrics.
- Usage of
T5
model: sometimes the model produced unexpected and nonsensical text. This issue probably stemmed from the model's reliance on the training data, which sometimes included phrases that, while grammatically correct, didn't make sense in context. For example, instead of suggesting a straightforward correction for a misspelled word,T5
might generate a convoluted sentence that contained unrelated terms or altered the intended meaning entirely. Because of that I had to change my way of calculating confusion matrix with sets which is not as correct, but gives similar results. - Gathering data: due to issues with availability and quality, I couldn't find many good datasets to compare models on
- Resource constraints: Limited access to computational resources and data preprocessing tools slowed the data-gathering and model usage process, making it difficult to use large models and datasets. Due to lack of computational resources I couldn't test the
T5
model on as much data.
In order to run the code locally you need to have Python 3.10
installed.
- Create a virtual environment with
venv
orconda
- Copy the repository with
git clone https://github.com/dec0dedd/jetbrains-writing-assistance.git
- Install all the required dependencies with
pip install -r requirements.txt
- Download all the required datasets and put them in their respective
data.dat
files (e.g. holbrook corpus should be put indata/holbrook/data.dat
while aspell testing corpus indata/aspell/data.dat
) or use already prepared ones - Run
make parse_data
to generate all parsed CSV files for testing models (or use already available data fromdata/
) - Run
make run_all
to run all models to generate model metrics (or use already available metric data frommetrics/
) - Re-create all plots based on new data with command
python gen_results.py
. Metric data will be available in CSV format inmetric_data/{model_name}.csv
, e.g.metric_data/t5.csv
ormetric_data/textblob.csv
- New plots based on generated results will be available in
plots/
with name{metric}.png
, e.g.latency.png