Skip to content

Malware detection using performance counters' data and deep learning time series classification methods. (POC)

Notifications You must be signed in to change notification settings

omarmuhamed/Malware-Detection-With-Performance-Counters

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Malware Detection With Performance Counters (Experiment, POC)


In this experiment I have tested whether the malwares can be detected using deep learning classifiers and malware's performance counters' values. To conduct this experiment we need to do three things:

so lets start with the performance counter extractor tool.

Performance Counter Extractor (perfextract)

The tool can extract performance counter values by starting a new process or attaching to a running process. All counters are being extracted simultaneously for 30 seconds with 0.5 second interval. The tool can track all child processes and extract their data as well. We need 23 counter for our experiment which are listed below:

Counter Description
% Privileged Time Shows the percentage of non-idle processor time spent executing code in privileged mode.
Handle Count Shows the total number of handles currently open by this process. This number is the equal to the sum of the handles currently open by each thread in this process.
IO Read Operations/sec Shows the rate, in incidents per second, at which the process was issuing read I/O operations. It counts all I/O activity generated by the process including file, network, and device I/Os.
IO Data Operations/sec Shows the rate, in incidents per second, at which the process was issuing read and write I/O operations. It counts all I/O activity generated by the process including file, network, and device I/Os.
IO Write Operations/sec Shows the rate, in incidents per second, at which the process was issuing write I/O operations. It counts all I/O activity generated by the process including file, network, and device I/Os.
IO Other Operations/sec Shows the rate, in incidents per second, at which the process was issuing I/O operations that were neither read nor write operations (for example, a control function). It counts all I/O activity generated by the process including file, network, and device I/Os.
IO Read Bytes/sec Shows the rate, in incidents per second, at which the process was reading bytes from I/O operations. It counts all I/O activity generated by the process including file, network, and device I/Os.
IO Write Bytes/sec Shows the rate, in incidents per second, at which the process was writing bytes to I/O operations. It counts all I/O activity generated by the process including file, network, and device I/Os.
IO Data Bytes/sec Shows the rate, in incidents per second, at which the process was reading and writing bytes in I/O operations. It counts all I/O activity generated by the process including file, network, and device I/Os.
IO Other Bytes/sec Shows the rate, in incidents per second, at which the process was issuing bytes to I/O operations that do not involve data such as control operations. It counts all I/O activity generated by the process including file, network, and device I/Os.
Page Faults/sec Shows the rate, in incidents per second, at which page faults were handled by the processor. A page fault occurs when a process requires code or data that is not in its working set. This counter includes both hard faults (those that require disk access) and soft faults (those where the faulted page is found elsewhere in physical memory).
Page File Bytes Peak Shows the maximum amount of virtual memory, in bytes, that a process has reserved for use in the paging file(s). Paging files are used to store pages of memory used by the process.
Page File Bytes Shows the the current amount of virtual memory, in bytes, that a process has reserved for use in the paging file(s). Paging files are used to store pages of memory used by the process.
Pool Paged Bytes Shows the number of bytes in the paged pool, an area of system memory (physical memory used by the operating system) for objects that can be written to disk.
Pool Nonpaged Bytes Shows the number of bytes in the nonpaged pool, an area of system memory (physical memory used by the operating system) for objects that cannot be written to disk.
Private Bytes Shows the size, in bytes, that this process has allocated that cannot be shared with other processes.
Priority Base Shows the current base priority of this process. Threads within a process can raise and lower their own base priority relative to the process's base priority.
Thread Count Shows the number of threads that were active in this process.
Virtual Bytes Peak Shows the maximum size, in bytes, of virtual address space that the process has used at any one time.
Virtual Bytes Shows the size, in bytes, of the virtual address space that the process is using.
Working Set Peak Shows the maximum size, in bytes, in the working set of this process. The working set is the set of memory pages that were touched recently by the threads in the process.
Working Set Shows the size, in bytes, in the working set of this process.
Working Set - Private Subset of working set that specifically describes the amount of memory a process is using that can't be shared by other processes.
  • Usage

perfextract (-f <path to exe file> | -p <pid of specific running program>) [options] [-o <path to output folder>]

-c      Track child processes
  • Output

  • The tool outputs a CSV file contains the data extracted

Data Processing (utils)

Data processing is being done using pandas, numpy and sklearn. First we loaded CSV files as data frames using pandas and converted them to 3 dimensional numpy array. Each sample dimensionality is 60x23 so the dataset's dimensionality is Nx60x23, N is samples count. After loading the dataset we have normalize the data and for this we can use either min-max normalization or Z-score normalization from sklearn. The last step is splitting the dataset into train, test and validation sets and we used sklearn for this.

Classification Models (classifiers)

Fully Convolutional Neural Network (FCN)

FCNs are mainly convolutional networks that do not contain any local pooling layers which means that the length of a time series is kept unchanged throughout the convolutions. In addition, one of the main characteristics of this architecture is the replacement of the traditional final FC layer with a Global Average Pooling (GAP) layer which reduces drastically the number of parameters in a neural network while enabling the use of the Class Activation Maps (CAM) that highlights which parts of the input time series contributed the most to a certain classification.

Encoder

Encoder is a hybrid deep CNN whose architecture is inspired by FCN with a main difference where the GAP layer is replaced with an attention layer. Similarly to FCN, the first three layers are convolutional with some relatively small modifications. The first convolution is composed of 128 filters of length 5; the second convolution is composed of 256 filters of length 11; the third convolution is composed of 512 filters of length 21. Each convolution is followed by an instance normalization operation whose output is fed to the PReLU activation function. The output of PReLU is followed by a dropout operation and a final max pooling of length 2. The third convolutional layer is fed to an attention mechanism that enables the network to learn which parts of the time series are important for a certain classification. Finally, a traditional softmax classifier is fully connected to the latter layer with a number of neurons equal to the number of classes in the dataset.

Multi Layer Perceptron (MLP)

The network contains 4 layers in total where each one is fully connected to the output of its previous layer. The final layer is a softmax classifier, which is fully connected to its previous layer’s output and contains a number of neurons equal to the number of classes in a dataset. All three hidden FC layers are composed of 500 neurons with ReLU as the activation function. Each layer is preceded by a dropout operation.

ResNet

The network is composed of three residual blocks followed by a GAP layer and a final softmax classifier whose number of neurons is equal to the number of classes in a dataset. Each residual block is first composed of three convolutions whose output is added to the residual block’s input and then fed to the next layer. The number of filters for all convolutions is fixed to 64, with the ReLU activation function that is preceded by a batch normalization operation. In each residual block, the filter’s length is set to 8, 5 and 3.

Multi-scale Convolutional Neural Network (MCNN)

MCNN’s architecture is very similar to a traditional CNN model: with two convolutions (and max pooling) followed by an FC layer and a final softmax layer. On the other hand, this approach is very complex with its heavy data pre-processing step. Cui et al. (2016) were the first to introduce the Window Slicing (WS) method as a data augmentation technique. WS slides a window over the input time series and extract subsequences, thus training the network on the extracted subsequences instead of the raw input time series.

Multi Channel Deep Convolutional Neural Network (MCDCNN)

The architecture is mainly a traditional deep CNN with one modification for MTS data: the convolutions are applied independently (in parallel) on each dimension (or channel) of the input MTS. Each dimension for an input MTS will go through two convolutional stages with 8 filters of length 5 with ReLU as the activation function. Each convolution is followed by a max pooling operation of length 2. The output of the second convolutional stage for all dimensions is concatenated over the channels axis and then fed to an FC layer with 732 neurons with ReLU as the activation function. Finally, the softmax classifier is used with a number of neurons equal to the number of classes in the dataset.

Time Convolutional Neural Network (Time-CNN)

The first characteristic of Time-CNN is the use of the mean squared error (MSE) instead of the traditional categorical cross-entropy loss function. The network is composed of two consecutive convolutional layers with respectively 6 and 12 filters followed by a local average pooling operation of length 3. The convolutions adopt the sigmoid as the activation function. The network’s output consists of an FC layer with a number of neurons equal to the number of classes in the dataset.

Time Le-Net

This model can be considered as a traditional CNN with two convolutions followed by an FC layer and a final softmax classifier. There are two main differences with the FCNs: (1) an FC layer and (2) local max-pooling operations. For both convolutions, the ReLU activation function is used with a filter length equal to 5. For the first convolution, 5 filters are used and followed by a max pooling of length equal to 2. The second convolution uses 20 filters followed by a max pooling of length equal to 4. The convolutional blocks are followed by a non-linear fully connected layer which is composed of 500 neurons, each one using the ReLU activation function. Finally a softmax classifier.

Results

All the tested classifiers achieved >95% accuracy except Inception Time classifier.