This project was commissioned by a client on 2024-06-15. If you're interested in similar work, check out my freelance data analyst profile on Fastwork.
This project aims to perform sentiment analysis using a linear kernel SVM on an Instagram comment dataset provided by the client. Additionally, a Streamlit app was developed for seamless model deployment and user interaction.
The first step in the process was text preprocessing, which included a range of tasks from removing unnecessary characters to lemmatization. A significant challenge was translating Indonesian slang into formal English, as most text processing libraries, such as SpaCy, are optimized for English. To address this, I utilized Meta's 70B parameter LLaMA 3 model for translation, which proved more effective than tools like Google Translate. Below is a sample of 3 rows from the dataset after preprocessing:
username | sentimen | comment | translated_comment | case_folding | cleaning | lemmatization | remove_stopwords |
---|---|---|---|---|---|---|---|
pkk_desakisik | positif | terima kasih Bu Yani beserta Rombongan sudah datang di Desa Kisik | We would like to thank Mrs. Yani and her entourage for visiting Kisik Village. | we would like to thank mrs. yani and her entourage for visiting kisik village. | we would like to thank mrs yani and her entourage for visiting kisik village | like thank mrs yani entourage visit kisik village | like thank mrs yani entourage visit kisik village |
abde_prastio | positif | alhamdulillah makin keren kabupatenku sekarang 😍 | Praise be to God, my regency is amazing now. | praise be to god, my regency is amazing now. | praise be to god my regency is amazing now | praise god regency amazing now | praise god regency amazing |
maarif1515 | positif | Jalan poros kabupaten yang menghubungkan dari desa dampaan sampai dungus mohon untuk di tinjau | The highway connecting from Dampaan Village to Dungus, which passes through the district's axis, is requested to be reviewed. | the highway connecting from dampaan village to dungus, which passes through the district's axis, is requested to be reviewed. | the highway connecting from dampaan village to dungus which passes through the district s axis is requested to be reviewed | highway connect dampaan village dungus pass district axis request review | highway connect dampaan village dungus pass district axis request review |
The dataset was then split into training and testing sets with an 80:20 ratio. The pipeline used consisted of:
- TF-IDF (Term Frequency-Inverse Document Frequency): Converts text into numerical features by evaluating the importance of words within the context of the entire dataset.
- SMOTEENN (Synthetic Minority Over-sampling Technique combined with Edited Nearest Neighbors): Addresses class imbalance by oversampling minority classes and removing noise from the data.
- Linear Kernel SVM (Support Vector Machine): A machine learning model that finds the optimal hyperplane for classifying data, effective for linearly separable data.
The model achieved an accuracy score of 78.24%.
The Streamlit web app consists of 4 pages: Beranda (Home), Prediksi Data (Data Prediction), Prediksi Komentar (Comment Prediction), and Dataset, each offering unique features.
This page serves as an introduction. It provides an overview of the web app on the left side and displays a pie chart of sentiment distribution in the dataset on the right side.
Prediksi.Data.Data.Prediction.Page.mp4
This page allows users to upload a new dataset. The trained model will then predict the sentiments of the uploaded data.
Prediksi.Komentar.Comment.Prediction.Page.mp4
This page allows users to type in text. The trained model will then predict the sentiment of the entered text.
Dataset.Page.mp4
This page allows users to upload a new training dataset. Once the dataset is uploaded, the model will automatically retrain and update based on the new data.