Evaluating Icelandic Sentiment Analysis Models Trained on Translated Data

Instructions

Setting Up a Virtual Environment

To avoid conflicts with other projects or system-wide Python packages, it's recommended to set up a virtual environment for this project. Here's how to do it:

Prerequisites

Python 3.x (Ensure Python 3 is installed on your system.)

Creating a Virtual Environment

Navigate to Your Project Directory: Open a terminal or command prompt and navigate to the root directory of this project.
Create a Virtual Environment: Run the following command to create a virtual environment named env (you can choose any name you prefer):
```
python -m venv env
```
This command creates a new directory env within your project where all dependencies will be installed.
Activate the Virtual Environment:

On Windows, run:
```
.\env\Scripts\activate
```
On macOS or Linux, run:
```
source env/bin/activate
```

Installation of Dependencies

To run the scripts, you need to install the dependencies. Follow the steps below to set up your environment.

Prerequisites

Python 3.x (Make sure Python 3 is installed on your system.)

Installation Steps

Ensure Python 3.x is installed.
Install Requirements:
```
pip install -r requirements.txt
```
Install PyTorch: It's recommended that you use the GPU version (CUDA) of PyTorch, visit the PyTorch Get Started page, select your preferences, and run the provided installation command.

Machine-translate

This section provides instructions for using the machine translation scripts included in this project: translate_google.py and translate_mideind.py. These scripts are used for translating text data into Icelandic for sentiment analysis.

Using `translate_google.py`

Overview

translate_google.py is a Python script for translating text data using Google's translation service. It translates reviews from the "IMDB-Dataset.csv" file located in the Datasets directory and saves the translated text in a new file. The script uses multithreading to enhance performance and includes error handling for translation failures.

Prerequisites

Python 3.x
Pandas library
googletrans version 3.1.0a0
Other dependencies: concurrent.futures, threading, logging

Usage

Ensure the "IMDB-Dataset.csv" file is located in the Datasets directory.
Run the script:
```
python src/translate_google.py
```
The script will translate the data and output two files in the Datasets directory:
- IMDB-Dataset-GoogleTranslate.csv: Contains translated reviews and sentiments.
- failed-IMDB-Dataset-GoogleTranslate.csv: Logs failed translation attempts.

Custom Dataset

To use a different dataset:

Place your CSV dataset in the Datasets directory.
The dataset should have 'review' and 'sentiment' columns.
Modify the script if your dataset columns have different names.
Modify the script's dataset variable to match your dataset's filename.

Using `translate_mideind.py`

Overview

translate_mideind.py is a Python script for translating text data using the "mideind/nmt-doc-en-is-2022-10" model. It translates reviews from the "IMDB-Dataset.csv" file in the Datasets directory and saves the translated text in a new file.

Prerequisites

Python 3.x
PyTorch
transformers library
Pandas library
Other dependencies: re, logging

Note

If you plan to use GPU acceleration with PyTorch, make sure your CUDA version is compatible with the installed PyTorch version.

Usage

Run the script:
```
python src/translate_mideind.py
```
The script will process the data and output two files in the Datasets directory:
- IMDB-Dataset-MideindTranslate.csv: Contains translated reviews and sentiments.
- failed-IMDB-Dataset-MideindTranslate.csv: Logs failed translation attempts.

Custom Dataset

To use a different dataset:

Place your CSV dataset in the Datasets directory.
The dataset should have 'review' and 'sentiment' columns.
Modify the script if your dataset columns have different names.
Modify the script's dataset variable to match your dataset's filename.

Process

Processing Icelandic Text

This section provides instructions for using the process.py script, which performs text normalization and preprocessing for Icelandic text using IceNLP.

Prerequisites

Python 3.x
Pandas library
IceNLP tool (https://github.com/hrafnl/icenlp)
Other dependencies: multiprocessing, os, string, sys, time, tkinter, re, joblib, nefnir

Installation

Download IceNLP from IceNLP GitHub Repository and extract it.

Usage

Run the script:
```
python src/process.py
```
When prompted, select the icetagger.bat file located in the extracted IceNLP directory (IceNLP-1.5.0\IceNLP\bat\icetagger).
Ensure the dataset file (IMDB-Dataset-MideindTranslate.csv) is located in the Datasets directory relative to the script.
The script will process the dataset and output the processed data to Datasets/IMDB-Dataset-MideindTranslate-processed-nefnir.csv.

Custom Dataset

To use a different dataset:

Place your CSV dataset in the Datasets directory.
The dataset should have 'review' and 'sentiment' columns.
Modify the dataset_path variable in the script to match your dataset's filename.

Processing English Text

This section provides instructions for using the process_eng.py script, which performs text normalization and preprocessing for English text.

Prerequisites

Python 3.x
Pandas library
NLTK library
Other dependencies: os, time, re, joblib

Installation

Download necessary NLTK data:

python -m nltk.downloader punkt stopwords wordnet

Usage

Ensure the dataset file (IMDB-Dataset.csv) is located in the Datasets directory.
Run the script:
```
python src/process_eng.py
```
The script will process the dataset and output the processed data to Datasets/IMDB-Dataset-Processed.csv.

Custom Dataset

To use a different dataset:

Place your dataset in the Datasets directory.
The dataset should be in CSV format with a 'review' column.
Modify the dataset_path variable in the script to match your dataset's filename.

Baseline Classifiers

This section provides instructions for using the BaselineClassifiersBinary.ipynb script, which trains SVC, Logistic Regression and Naive Bayes on English, Icelandic Google and Icelandic Miðeind datasets, it also generates classification reports for each model.

Prerequisites

Python 3.x
PyTorch
Pandas library
Scikit-learn library
Other dependencies: os, time, numpy

Usage

Go into BaselineClassifiersBinary.ipynb and run the cells. You have to change the ICELANDIC_GOOGLE_CSV, ICELANDIC_MIDEIND_CSV and ENGLISH_CSV variables to point to the correct datasets. The cell will train and print out the classification reports for each model. It will also show a diagram. You can refer to the next cell if you want to print out the most important features, although this is not necessary.

Transformer Models

This section provides instructions for using the train.py script, which trains a transformer model for sentiment analysis.

Prerequisites

Python 3.x
Transformers library
PyTorch
Pandas library
Scikit-learn library
Other dependencies: os, time, numpy

Note

If you plan to use GPU acceleration with PyTorch, make sure your CUDA version is compatible with the installed PyTorch version.

Usage

Place the dataset file (default: "IMDB-Dataset-GoogleTranslate.csv") in the Datasets directory relative to the script.
Modify the script if you want to use a different pre-trained model or dataset.
Run the script:
```
python src/train.py
```
The script will train the model using the specified dataset and save the trained model and tokenizer in the Models directory.

Custom Dataset

To use a different dataset:

Place your dataset in the Datasets directory.
The dataset should be in CSV format with 'review' and 'sentiment' columns.
Modify the dataset_path variable in the script to match your dataset's filename.

Generating Classification Reports

This section provides instructions for using the generate_report.ipynb script, which generates a classification report for a trained model. See the pre-trained transformer model at huggingface: https://huggingface.co/Birkir/electra-base-igc-is-sentiment-analysis-google-translate

This is useful mostly for the transformer models, as the baseline classifiers generate their own reports via the same libraries.

This function will call the model and generate a classification report for the model. What it expects is the path to a folder of the model, the device to use, the pandas columns to use as X and y, and whether to return the accuracy or the classification report.

Usage

Import generate_classification_report.py import generate_classification_report as gcr
Load the CSV file with the data to be tested df = pd.read_csv('IMDB-Dataset-GoogleTranslate.csv')
Invoke the function call call_model, which takes the parameters

X_all: All review columns
y_all: All sentiment columns
model: The model to be used (This is a path to a file, something like ./electra-base-google-batch8-remove-noise-model/) or the path to huggingface Birkir/electra-base-igc-is-sentiment-analysis-google-translate)
device: The device to be used (CUDA, cpu)
accuracy: Whether to return accuracy or return a classification report

Example

Example of how to generate a report can be seen in generate_report.ipynb - also the generate_classification_report.py eval_files() function, which is loading multiple models.

URLs

https://github.com/olafurjohannsson/sentiment-analysis/tree/main

https://huggingface.co/Birkir/electra-base-igc-is-sentiment-analysis-google-translate

License

MIT

Authors

Ólafur Aron Jóhannsson
Eysteinn Örn
Birkir Arndal

Name		Name	Last commit message	Last commit date
Latest commit History 201 Commits
.github/workflows		.github/workflows
.vscode		.vscode
Datasets		Datasets
src		src
.gitignore		.gitignore
FinalReport.pdf		FinalReport.pdf
LICENSE		LICENSE
README.md		README.md
README.pdf		README.pdf
git-url.txt		git-url.txt
huggingface-url.txt		huggingface-url.txt
requirements.txt		requirements.txt

License

cadia-lvl/sentiment-analysis

Folders and files

Latest commit

History

Repository files navigation

Evaluating Icelandic Sentiment Analysis Models Trained on Translated Data

Instructions

Setting Up a Virtual Environment

Prerequisites

Creating a Virtual Environment

Installation of Dependencies

Prerequisites

Installation Steps

Machine-translate

Using translate_google.py

Overview

Prerequisites

Usage

Custom Dataset

Using translate_mideind.py

Overview

Prerequisites

Note

Usage

Custom Dataset

Process

Processing Icelandic Text

Prerequisites

Installation

Usage

Custom Dataset

Processing English Text

Prerequisites

Installation

Usage

Custom Dataset

Baseline Classifiers

Prerequisites

Usage

Transformer Models

Prerequisites

Note

Usage

Custom Dataset

Generating Classification Reports

Usage

Example

URLs

License

Authors

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Using `translate_google.py`

Using `translate_mideind.py`

Packages