- Analysis of Benign and DDoS Attack dataflows, machine learning models to detect malicious DDoS activity.
- Collect and clean data, normalize columns, drop NA and fix infinite values.
- Separate databases by protocol and attack, recombine into balanced and anomaly detection datasets.
- Exploratory data analysis & visualizations.
- Prediction models to detect DDoS attacks.
- Started with 19 datasets which were combined into 11 DDoS and 1 Benign dataset of over 70 common websites/services.
- 11 types of DDoS attacks: DNS, LDAP, MSSQL, NTP, NetBIOS, Portmap, SNMP, SSDP, Syn, TFTP and UDPLag.
- Over 500,000,000 datapoints combined, each representing one data flow.
- Target variables:
- Malicious (Binary Classification).
- Label (Multiclass Classification).
- Over 80 features:
- Protocol: TCP, UDP or HOPOPT.
- Flow duration, down/uptime ratio.
- Time active and idle min, max, mean and std.
- Total forward & backward packets and size.
- Forward and backward packet length min, max, mean and std.
- Forward and backward header size.
- Forward and backward bytes/sec and packets/sec rates.
- Forward and backward interarrival time min, max, mean and std.
- Forward and backward PSH and URG flags, total FIN, SYN, RST, ACK, CWE and ECE flags.
- Forward and backward subflow packets, size.
- Forward and backward initial window size.
- Binary Classification in a balanced dataset: 50% Benign, 50% Malicious.
- Multiclass Classification in a balanced dataset: 50% Benign, 50% divided into 11 DDoS attacks.
- Binary Classification in an anomaly detection dataset: 99% Benign, 1% Malicious
- Multiclass Classification in an anomaly detection dataset: 99% Benign, 1% divided into 11 DDoS attacks.
- Binary Classification in a balanced dataset:
- Dummy Classifier: Acc .512, F1 .513
- Naive Bayes: Acc .822, F1 .844
- Decision Tree: Acc .999, F1 .999
- K Nearest Neighbors: Acc .997, F1 .997
- Random Forest: Acc .999, F1 .999
- XGBoost: Acc 1.0, F1 1.0
- Multiclass Classification in a balanced dataset:
- Dummy Classifier: Acc .000, F1 .273
- Naive Bayes: Acc .662, F1 .844
- Decision Tree: Acc .924, F1 .933
- K Nearest Neighbors: Acc .920, F1 .926
- Random Forest: Acc .928, F1 .937
- XGBoost: Acc .929, F1 .938
- Simple models such as Decision Tree work very well.
- Given so much information about a dataflow (80+ features) it is easy to detect DDoS Attacks.
- It is hard to monitor that much information in real time so the following attributes should be monitored:
- Min, max and mean packet size of a dataflow.
- Mean header size.
- Protocol.
- Collect more benign data for dataset of .01% malicious dataflows.
- Test trained model on real-time data.
- Build front-end GUI.
-
DDoS data collected from Canadian Institute for Cybersecurity:
-
Benign data collected from Kaggle:
-
All data originally created with CICFlowMeter:
-
Images obtained from WP DIY:
-
DDoS/Cybersecurity statistics and info obtained from Cybercrime Magazine, Cloudfare & Wikipedia:
- Data_Cleaning_Notebook.ipynb Notebook used to drop NA values, normalize column names, separate data into CSVs by attack and protocol.
- EDA_Notebook.ipynb Notebook used for exploratory data analysis and the creation of visualizations.
- Modeling_Notebook.ipynb Notebook used to run dummy classifier and various machine learning models to classify DDoS attacks.
- Directory containing model scores and feature importances.
- Directory containing all data visualizations.
- functions.py Functions used for data cleaning, reading in datasets, etc.
- README.md File explaining the project.