This project is home to a naive bayesian classifier implemented from scratch in go. The dataset this program was designed for is comprised of 12 binary features followed by binary class label, and each feature corresponds to indicators/attributes collected from spam and legitimate emails.
Baye's Theorem states that the probability of a hypothesis H conditional on a given body of data E is the ratio of the unconditional probability of the conjunction of the hypothesis with the data to the unconditional probability of the data alone.
Baye's theorem is defined as the probability of H conditional on E is defined as PE(H) = P(H & E)/P(E), provided that both terms of this ratio exist and P(E) > 0.
Baye's theorem is used to calculate the conditional probability of an input belonging to a specific class based on prior knowledge. In this case, the program takes two command-line arguments to a labelled and unlabelled dataset. First, it applies the knowledge gained by the model's during the training phase from the labelled dataset to determine the conditional probability of a given feature belonging to the spam or non-spam classes. It then uses this probability to label the unlabelled set, before displaying its findings.
This project was mainly completed as a way to learn and practice go, it was not intended for practical or varied use; some functions are designed around the specific datasets.
Tested on Mac and Linux.
Five Step Process
- Separate Data By Classes.
- Summarize Dataset.
- Summarize Data By Classes.
- Run Gaussian Probability Density Function.
- Estimate Class Probabilities.
Build the program from source using
go build main.go
To run the program, first make sure it is marked as executable with
chmod +x main
Then run the program with
./main <path-to>spamLabelled.dat <path-to>spamUnlabelled.dat
Feel free to fork or make contributions. Any feedback is always welcome!