- Authors: Marin Krešo, Matteo Miloš, Josip Renić
- published at https://www.fer.unizg.hr/_download/repository/TAR-2018-ProjectReports.pdf
Recent reports clearly indicate dramatic growth in volume of SMS spam messages. SMS spam classification is a challenging problem, as this kind of messages are rife with idioms and abbreviations. Most common and baseline solution for this is using Multinomial Naive Bayes algorithm with Bag-of-words term frequencies as features. As an alternative, we propose pipeline approach that uses NLP (Natural Language Processing) techniques, extracts new features and does hyperparameter optimization for most popular machine learning classification algorithms. Our results on the SMS Spam Collection dataset show that by incorporating our proposed pipeline approach, SMS spam classification system can yield statistically significant performance gain as compared to the baseline.