Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip Accents option for Tokenizers #270

Open
ertugrul-dmr opened this issue May 20, 2021 · 5 comments
Open

Strip Accents option for Tokenizers #270

ertugrul-dmr opened this issue May 20, 2021 · 5 comments
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@ertugrul-dmr
Copy link
Contributor

While doing error analysis I noticed that some texts are written using stripped versions of Turkish characters like çok>cok, değil>degil, ağaç>agac etc. while some of them are not. This leading to several different tokens for same word for some vectorizers.

I believe this is a worthy try to test and see if it's working.

I'll be working on this and if I get satisfactory test results then I'm going to open pull request for it.

For this purpose:

  • Going to create a function that strip accents,
  • Test that function over some generated sentences including specific accents,
  • Implement it to code if the results are looking promising, Note: previous bug spotted might need to work on it first
  • Then test it on several prebuilt models and analyze metrics,
  • If it passes all above, there will be pull request for it.
@husnusensoy
Copy link
Contributor

Please do check the way that strip_accents works in sklearn may be we can have the same capability. But first do prove that it really improves some model or so.

@ertugrul-dmr
Copy link
Contributor Author

ertugrul-dmr commented May 27, 2021

Results:

I have implemented similar function to preprocess texts and tested them on prebuilt models. In average it decreased our results:

Prebuilt Model Original Result Preprocessed Result
Tweet Sentiment Classification 3-Fold F-1: 0.8587, 5-Fold F-1: 0.8613 3-Fold F-1: 0.8587, 5-Fold F-1: 0.8637
Movie Review Sentiment Classification F-1: 0.8258 F-1: 0.7816
Telco Tweet Sentiment Classification F-1: 0.6871, Accuracy: 0.6925 F-1: 0.694, Accuracy: 0.699
Turkish Customer Reviews Classification F-1: 0.851 F-1: 0.8132

Conclusion:

My observations are: This function does small to none improvement over models built with HashVectorizer. Meanwhile it deteriorates tf-idf models which I believe increses OOV token numbers a lot...

I suggest we can add it optionally where we might get some improvements in future uses; based on raw text and vectorizer we use...

@ertugrul-dmr ertugrul-dmr added the question Further information is requested label May 27, 2021
@husnusensoy
Copy link
Contributor

Now you are talking ...

@husnusensoy
Copy link
Contributor

How did you perform this tests ? Sadedegel does not support accept_stripping for now ?

@ertugrul-dmr
Copy link
Contributor Author

I added the strip_accents function on my local, then implemented it similar like emoji, hashtag, mention. Then tested models with strip_accents = True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants