Skip to content

mohammadhashemii/Web-Crawler-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler Detection

This work is the final project of Rahnema College Machine Leaning internship program. The aim of this project is to train an unsupervised model in order to recognize the web requests given to Sanjagh Website are crawlers or not. Finally, for taking advantage of this work, we had to create and develop a web application for production phase. Reach the webpage by clicking this link: Demo

We highly recommend you to see the presentation slides here to get a better intuition of what we have done.

Dataset

The dataset has been obtained from the Sanjagh server logs. Although it cannot be publicly released, a tiny sample of can be found in output.log. In case you are interested in the complete dataset, you can use any other nginx log servers available in the world-wide internet.

A sample record structure is as follows:

207.213.193.143 [2021-5-12T5:6:0.0+0430] [Get /cdn/profiles/1026106239] 304 0 [[Googlebot-Image/1.0]] 32

Project phases

Phase Description
EDA Exploratory Data Analysis and Feature Engineering.
Baseline Models Train some common and baseline Clustering models for anomaly detection.
Advanced Models Auto encoders are used!
Demo A simple demo webpage developed.

EDA

In this phase we just got to know the data better! We exploratory searched about useful information in the dataset and tried to extract appropriate clues from it. It is highly recommended running the 01_sanjaghDatasetEDA.ipynb in notebooks/ to see what we have exactly done in this part.

Then we had to create and generate some features per session. These features can be modified in my_utils.py in utils/ Here is the list of the features we have used:

Features per session Description
Click rate Higher click rate can only be achieved by an automated script.
STD of path’s depth Deeper requests usually indicates a human user
Percentage of 4xx status codes Usually higher for crawlers as there is higher chances of hitting an outdated or deleted pages.
Percentage of 3xx status codes Indicates redirected requests
Percentage of HTTP HEAD requests Usually higher for crawlers as there is higher chances of hitting an outdated or deleted pages.
Percentage of image requests Web crawlers usually ignore images
Average & sum of response_length & response_time Human users retrieve info from the web via browser, so it forces the user’s session to request additional resource automatically.
Set the user agent attributes Browser - OS - is_bot - is_pc
Average of time between requests Is more for human requests
Number of robots.txt requests Crawlers wants to know the limitations!
Percentage of consecutive repeated requests Crawlers wants to know the limitations!

Baseline models

The baseline models we decided to use are IsolationForest and LocalOutlierFactor.

IsolationForest

The Isolation Forest is a technique for the detection of outlier samples. Since outliers have features X that differ significantly from most of the samples, they are isolated earlier in the hierarchy of a decision tree. Outliers are detected by setting a threshold on the mean length (number of splits) from the top of the tree downwards. The Scikit-learn implementation provides a score for each sample that increases from -1 to +1 with the number of splits. The sample with lower score are likely to be outliers. The outlier threshold on the score must be set by the user.

And then for better visualization, we applied PCA with 3 components to see how outilers and inliers are seperated.

LocalOutlierFactor

The anomaly score of each sample is called Local Outlier Factor. It measures the local deviation of density of a given sample with respect to its neighbors. It is local in that the anomaly score depends on how isolated the object is with respect to the surrounding neighborhood. More precisely, locality is given by k-nearest neighbors, whose distance is used to estimate the local density. By comparing the local density of a sample to the local densities of its neighbors, one can identify samples that have a substantially lower density than their neighbors. These are considered outliers.

But the results of IsolationForest were more accurate.

Advanced Models

One of the most promising methods for unsupervised task is Auto encoders. We got the best results for that.

Auto encoders

It is a neural network architecture capable of discovering structure within data in order to develop a compressed representation of the input. Applications: anomaly detection, data denoising and ... .

The training configuration we use is as follows. These can be modified in configs.py which can be found in configs/.

Setting Type
Optimizer Adam
Loss MSE
Activations ReLu
# of epochs 20
Batch size 64
Percentage of image requests Web crawlers usually ignore images

Additionally, multiple architectures have been tested and the comparison of them is shown below:

# of neurons Train loss Test loss
[15, 7, 15] 0.42 0.48
[15, 3, 15] 0.28 0.39
[15, 7, 3, 7, 15] 0.29 0.43
[15, 7, 7, 7, 15] 0.31 0.42

As for all the unsupervised algorithms an anomaly score threshold should be selected, after many experiments, the MSE threshold which fits best for the dataset is 0.26. But you can modify it based on your own dataset in configs.py which can be found in configs/. Also we evaluated our models and it can be checked thorough presentation slides in presentation/.

Demo

We have used Flask and React to develop a webpage for our project. Firstly check it here. In case you are interested in running the webpage locally follow the steps below:

How to run

The whole source code of the webpage can be found in App/. Firstly, clone the repository and run the commands below:

  1. Clone the repo:
$ git clone https://github.com/mohammadhashemii/Web-Crawler-Detection
$ cd Web-Crawler-Detection
  1. Run the backend:
$ cd backend
$ pip install flask flask-cors
$ python app.py
  1. Run the frontend:
$ cd frontend
$ npm install
$ npm start