Strip Accents option for Tokenizers #270

ertugrul-dmr · 2021-05-20T16:37:44Z

While doing error analysis I noticed that some texts are written using stripped versions of Turkish characters like çok>cok, değil>degil, ağaç>agac etc. while some of them are not. This leading to several different tokens for same word for some vectorizers.

I believe this is a worthy try to test and see if it's working.

I'll be working on this and if I get satisfactory test results then I'm going to open pull request for it.

For this purpose:

Going to create a function that strip accents,
Test that function over some generated sentences including specific accents,
Implement it to code if the results are looking promising, Note: previous bug spotted might need to work on it first
Then test it on several prebuilt models and analyze metrics,
If it passes all above, there will be pull request for it.

The text was updated successfully, but these errors were encountered:

husnusensoy · 2021-05-23T23:39:06Z

Please do check the way that strip_accents works in sklearn may be we can have the same capability. But first do prove that it really improves some model or so.

ertugrul-dmr · 2021-05-27T11:07:52Z

Results:

I have implemented similar function to preprocess texts and tested them on prebuilt models. In average it decreased our results:

Prebuilt Model	Original Result	Preprocessed Result
Tweet Sentiment Classification	3-Fold F-1: 0.8587, 5-Fold F-1: 0.8613	3-Fold F-1: 0.8587, 5-Fold F-1: 0.8637
Movie Review Sentiment Classification	F-1: 0.8258	F-1: 0.7816
Telco Tweet Sentiment Classification	F-1: 0.6871, Accuracy: 0.6925	F-1: 0.694, Accuracy: 0.699
Turkish Customer Reviews Classification	F-1: 0.851	F-1: 0.8132

Conclusion:

My observations are: This function does small to none improvement over models built with HashVectorizer. Meanwhile it deteriorates tf-idf models which I believe increses OOV token numbers a lot...

I suggest we can add it optionally where we might get some improvements in future uses; based on raw text and vectorizer we use...

husnusensoy · 2021-05-27T11:08:54Z

Now you are talking ...

husnusensoy · 2021-05-27T11:10:28Z

How did you perform this tests ? Sadedegel does not support accept_stripping for now ?

ertugrul-dmr · 2021-05-27T11:14:06Z

I added the strip_accents function on my local, then implemented it similar like emoji, hashtag, mention. Then tested models with strip_accents = True

ertugrul-dmr self-assigned this May 20, 2021

ertugrul-dmr added the enhancement New feature or request label May 20, 2021

ertugrul-dmr mentioned this issue May 21, 2021

Hashtag, Emoji and Mention Handler not Working #273

Open

ertugrul-dmr added the question Further information is requested label May 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip Accents option for Tokenizers #270

Strip Accents option for Tokenizers #270

ertugrul-dmr commented May 20, 2021

husnusensoy commented May 23, 2021

ertugrul-dmr commented May 27, 2021 •

edited

Loading

husnusensoy commented May 27, 2021

husnusensoy commented May 27, 2021

ertugrul-dmr commented May 27, 2021

Strip Accents option for Tokenizers #270

Strip Accents option for Tokenizers #270

Comments

ertugrul-dmr commented May 20, 2021

husnusensoy commented May 23, 2021

ertugrul-dmr commented May 27, 2021 • edited Loading

husnusensoy commented May 27, 2021

husnusensoy commented May 27, 2021

ertugrul-dmr commented May 27, 2021

ertugrul-dmr commented May 27, 2021 •

edited

Loading