Adding Stemming and Lemmatization #281

irmakyucel · 2021-06-09T08:45:29Z

Adding an option for Stemming and/or Lemmatization is important when using count, hash and tf-idf vectorizers as it makes the vocabulary smaller by understanding words having same root or lemma respectively. It also makes the patterns within a dataset more visible to the model.

Stemming

Can be achieved easily by using rule-based applications. As it aims to find the root of the word the resulting words don't have to be meaningful.
These Stems/Roots are created by removing the suffixes or prefixes used within a word.
One such rule based stemming is done in hash vectorizer of sadedegel.
So by adding other rule based methods and the one previously used in hash vectorizer we can make a class for Stemmer to be used flexibly in sadedegel platform.

Lemmatization

Lemmatization deals with finding the lemma (~başsözcük) of words. This is usually done by doing a lookup on a database. (For instance NLTK has a WordNet Lemmatizer that uses WordNet Database for lemma lookup.)
For this we would need to find a lemma database for Turkish (if such exists) and transform words by looking up from the database.
This is usually more time consuming during computation.

I believe it would be a good start to start with Stemming and then move on to Lemmatization.

irmakyucel · 2021-07-06T14:20:12Z

I tested the idea of adding a Stemmer by using two libraries that support stemming for Turkish language. The two libraries are TurkishStemmer(Snowball) and SimpleLemma. These libraries are tested on TELCO Review and Tweet Sentiment datasets for comparison reasons. These datasets are chosen because TELCO Review Model performs poorly and Tweet Sentiment Model performs well. So by comparing them I wanted to check the behavior of the stemmer in good and bad performing models. All of the versions (with/without stemmers) are optimized using Optuna. For reporting the results F1 Macro Score is used. These results are shown below:

Results

Dataset	Previous Score	w/TurkishStemmer	w/SimpleLemma
TELCO Review	0.6833	0.6820	0.6755
Tweet Sentiment	0.8565	0.8486	0.8489

The results show that not much change is made by adding a stemmer as the score either stays the same or decreases by ~0.01 points. As the change is very small, l also analyzed cross validation scores. The cross validation scores are reported in terms of Accuracy and are shown below:

Dataset	Previous Score	w/TurkishStemmer	w/SimpleLemma
TELCO Review	0.6103	0.6109	0.5975
Tweet Sentiment	0.8208	0.8106	0.8148

irmakyucel added enhancement New feature or request question Further information is requested labels Jun 9, 2021

irmakyucel self-assigned this Jun 9, 2021

irmakyucel changed the title ~~Adding Stemming and Lemmazitation~~ Adding Stemming and Lemmatization Jun 11, 2021

askarbozcan added cleanup-remove Issues that WILL be removed as part of cleanup. cleanup-stay Issues that won't be removed as part of cleanup and removed cleanup-remove Issues that WILL be removed as part of cleanup. labels Aug 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Stemming and Lemmatization #281

Adding Stemming and Lemmatization #281

irmakyucel commented Jun 9, 2021

irmakyucel commented Jul 6, 2021 •

edited

Loading

Adding Stemming and Lemmatization #281

Adding Stemming and Lemmatization #281

Comments

irmakyucel commented Jun 9, 2021

irmakyucel commented Jul 6, 2021 • edited Loading

irmakyucel commented Jul 6, 2021 •

edited

Loading