Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Stemming and Lemmatization #281

Open
irmakyucel opened this issue Jun 9, 2021 · 1 comment
Open

Adding Stemming and Lemmatization #281

irmakyucel opened this issue Jun 9, 2021 · 1 comment
Assignees
Labels
cleanup-stay Issues that won't be removed as part of cleanup enhancement New feature or request question Further information is requested

Comments

@irmakyucel
Copy link

Adding an option for Stemming and/or Lemmatization is important when using count, hash and tf-idf vectorizers as it makes the vocabulary smaller by understanding words having same root or lemma respectively. It also makes the patterns within a dataset more visible to the model.

Stemming

  • Can be achieved easily by using rule-based applications. As it aims to find the root of the word the resulting words don't have to be meaningful.
  • These Stems/Roots are created by removing the suffixes or prefixes used within a word.
  • One such rule based stemming is done in hash vectorizer of sadedegel.
  • So by adding other rule based methods and the one previously used in hash vectorizer we can make a class for Stemmer to be used flexibly in sadedegel platform.

Lemmatization

  • Lemmatization deals with finding the lemma (~başsözcük) of words. This is usually done by doing a lookup on a database. (For instance NLTK has a WordNet Lemmatizer that uses WordNet Database for lemma lookup.)
  • For this we would need to find a lemma database for Turkish (if such exists) and transform words by looking up from the database.
  • This is usually more time consuming during computation.

I believe it would be a good start to start with Stemming and then move on to Lemmatization.

@irmakyucel irmakyucel added enhancement New feature or request question Further information is requested labels Jun 9, 2021
@irmakyucel irmakyucel self-assigned this Jun 9, 2021
@irmakyucel irmakyucel changed the title Adding Stemming and Lemmazitation Adding Stemming and Lemmatization Jun 11, 2021
@irmakyucel
Copy link
Author

irmakyucel commented Jul 6, 2021

I tested the idea of adding a Stemmer by using two libraries that support stemming for Turkish language. The two libraries are TurkishStemmer(Snowball) and SimpleLemma. These libraries are tested on TELCO Review and Tweet Sentiment datasets for comparison reasons. These datasets are chosen because TELCO Review Model performs poorly and Tweet Sentiment Model performs well. So by comparing them I wanted to check the behavior of the stemmer in good and bad performing models. All of the versions (with/without stemmers) are optimized using Optuna. For reporting the results F1 Macro Score is used. These results are shown below:

Results

Dataset Previous Score w/TurkishStemmer w/SimpleLemma
TELCO Review 0.6833 0.6820 0.6755
Tweet Sentiment 0.8565 0.8486 0.8489

The results show that not much change is made by adding a stemmer as the score either stays the same or decreases by ~0.01 points. As the change is very small, l also analyzed cross validation scores. The cross validation scores are reported in terms of Accuracy and are shown below:

Dataset Previous Score w/TurkishStemmer w/SimpleLemma
TELCO Review 0.6103 0.6109 0.5975
Tweet Sentiment 0.8208 0.8106 0.8148

@askarbozcan askarbozcan added cleanup-remove Issues that WILL be removed as part of cleanup. cleanup-stay Issues that won't be removed as part of cleanup and removed cleanup-remove Issues that WILL be removed as part of cleanup. labels Aug 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup-stay Issues that won't be removed as part of cleanup enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants