-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strip Accents option for Tokenizers #270
Comments
Please do check the way that |
Results: I have implemented similar function to preprocess texts and tested them on prebuilt models. In average it decreased our results:
Conclusion: My observations are: This function does small to none improvement over models built with HashVectorizer. Meanwhile it deteriorates tf-idf models which I believe increses OOV token numbers a lot... I suggest we can add it optionally where we might get some improvements in future uses; based on raw text and vectorizer we use... |
Now you are talking ... |
How did you perform this tests ? Sadedegel does not support accept_stripping for now ? |
I added the strip_accents function on my local, then implemented it similar like |
While doing error analysis I noticed that some texts are written using stripped versions of Turkish characters like
çok>cok, değil>degil, ağaç>agac
etc. while some of them are not. This leading to several different tokens for same word for some vectorizers.I believe this is a worthy try to test and see if it's working.
I'll be working on this and if I get satisfactory test results then I'm going to open pull request for it.
For this purpose:
The text was updated successfully, but these errors were encountered: