In this experiment I have tested whether the malwares can be detected using deep learning classifiers and malware's performance counters' values. To conduct this experiment we need to do three things:
- A tool to extract performance counters' values
- Data Processing
- Multivariate time series classification models
so lets start with the performance counter extractor tool.
Performance Counter Extractor (perfextract)
The tool can extract performance counter values by starting a new process or attaching to a running process. All counters are being extracted simultaneously for 30 seconds with 0.5 second interval. The tool can track all child processes and extract their data as well. We need 23 counter for our experiment which are listed below:
Counter | Description |
---|---|
% Privileged Time | Shows the percentage of non-idle processor time spent executing code in privileged mode. |
Handle Count | Shows the total number of handles currently open by this process. This number is the equal to the sum of the handles currently open by each thread in this process. |
IO Read Operations/sec | Shows the rate, in incidents per second, at which the process was issuing read I/O operations. It counts all I/O activity generated by the process including file, network, and device I/Os. |
IO Data Operations/sec | Shows the rate, in incidents per second, at which the process was issuing read and write I/O operations. It counts all I/O activity generated by the process including file, network, and device I/Os. |
IO Write Operations/sec | Shows the rate, in incidents per second, at which the process was issuing write I/O operations. It counts all I/O activity generated by the process including file, network, and device I/Os. |
IO Other Operations/sec | Shows the rate, in incidents per second, at which the process was issuing I/O operations that were neither read nor write operations (for example, a control function). It counts all I/O activity generated by the process including file, network, and device I/Os. |
IO Read Bytes/sec | Shows the rate, in incidents per second, at which the process was reading bytes from I/O operations. It counts all I/O activity generated by the process including file, network, and device I/Os. |
IO Write Bytes/sec | Shows the rate, in incidents per second, at which the process was writing bytes to I/O operations. It counts all I/O activity generated by the process including file, network, and device I/Os. |
IO Data Bytes/sec | Shows the rate, in incidents per second, at which the process was reading and writing bytes in I/O operations. It counts all I/O activity generated by the process including file, network, and device I/Os. |
IO Other Bytes/sec | Shows the rate, in incidents per second, at which the process was issuing bytes to I/O operations that do not involve data such as control operations. It counts all I/O activity generated by the process including file, network, and device I/Os. |
Page Faults/sec | Shows the rate, in incidents per second, at which page faults were handled by the processor. A page fault occurs when a process requires code or data that is not in its working set. This counter includes both hard faults (those that require disk access) and soft faults (those where the faulted page is found elsewhere in physical memory). |
Page File Bytes Peak | Shows the maximum amount of virtual memory, in bytes, that a process has reserved for use in the paging file(s). Paging files are used to store pages of memory used by the process. |
Page File Bytes | Shows the the current amount of virtual memory, in bytes, that a process has reserved for use in the paging file(s). Paging files are used to store pages of memory used by the process. |
Pool Paged Bytes | Shows the number of bytes in the paged pool, an area of system memory (physical memory used by the operating system) for objects that can be written to disk. |
Pool Nonpaged Bytes | Shows the number of bytes in the nonpaged pool, an area of system memory (physical memory used by the operating system) for objects that cannot be written to disk. |
Private Bytes | Shows the size, in bytes, that this process has allocated that cannot be shared with other processes. |
Priority Base | Shows the current base priority of this process. Threads within a process can raise and lower their own base priority relative to the process's base priority. |
Thread Count | Shows the number of threads that were active in this process. |
Virtual Bytes Peak | Shows the maximum size, in bytes, of virtual address space that the process has used at any one time. |
Virtual Bytes | Shows the size, in bytes, of the virtual address space that the process is using. |
Working Set Peak | Shows the maximum size, in bytes, in the working set of this process. The working set is the set of memory pages that were touched recently by the threads in the process. |
Working Set | Shows the size, in bytes, in the working set of this process. |
Working Set - Private | Subset of working set that specifically describes the amount of memory a process is using that can't be shared by other processes. |
perfextract (-f <path to exe file> | -p <pid of specific running program>) [options] [-o <path to output folder>]
-c Track child processes
- The tool outputs a CSV file contains the data extracted
Data Processing (utils)
Data processing is being done using pandas
, numpy
and sklearn
.
First we loaded CSV files as data frames using pandas
and converted them to 3 dimensional numpy
array. Each sample dimensionality is 60x23
so the dataset's dimensionality is Nx60x23
, N is samples count.
After loading the dataset we have normalize the data and for this we can use either min-max normalization or Z-score normalization from sklearn
.
The last step is splitting the dataset into train, test and validation sets and we used sklearn
for this.
Classification Models (classifiers)
FCNs are mainly convolutional networks that do not contain any local pooling layers which means that the length of a time series is kept unchanged throughout the convolutions. In addition, one of the main characteristics of this architecture is the replacement of the traditional final FC layer with a Global Average Pooling (GAP) layer which reduces drastically the number of parameters in a neural network while enabling the use of the Class Activation Maps (CAM) that highlights which parts of the input time series contributed the most to a certain classification.
Encoder is a hybrid deep CNN whose architecture is inspired by FCN with a main difference where the GAP layer is replaced with an attention layer. Similarly to FCN, the first three layers are convolutional with some relatively small modifications. The first convolution is composed of 128 filters of length 5; the second convolution is composed of 256 filters of length 11; the third convolution is composed of 512 filters of length 21. Each convolution is followed by an instance normalization operation whose output is fed to the PReLU activation function. The output of PReLU is followed by a dropout operation and a final max pooling of length 2. The third convolutional layer is fed to an attention mechanism that enables the network to learn which parts of the time series are important for a certain classification. Finally, a traditional softmax classifier is fully connected to the latter layer with a number of neurons equal to the number of classes in the dataset.
The network contains 4 layers in total where each one is fully connected to the output of its previous layer. The final layer is a softmax classifier, which is fully connected to its previous layer’s output and contains a number of neurons equal to the number of classes in a dataset. All three hidden FC layers are composed of 500 neurons with ReLU as the activation function. Each layer is preceded by a dropout operation.
The network is composed of three residual blocks followed by a GAP layer and a final softmax classifier whose number of neurons is equal to the number of classes in a dataset. Each residual block is first composed of three convolutions whose output is added to the residual block’s input and then fed to the next layer. The number of filters for all convolutions is fixed to 64, with the ReLU activation function that is preceded by a batch normalization operation. In each residual block, the filter’s length is set to 8, 5 and 3.
MCNN’s architecture is very similar to a traditional CNN model: with two convolutions (and max pooling) followed by an FC layer and a final softmax layer. On the other hand, this approach is very complex with its heavy data pre-processing step. Cui et al. (2016) were the first to introduce the Window Slicing (WS) method as a data augmentation technique. WS slides a window over the input time series and extract subsequences, thus training the network on the extracted subsequences instead of the raw input time series.
The architecture is mainly a traditional deep CNN with one modification for MTS data: the convolutions are applied independently (in parallel) on each dimension (or channel) of the input MTS. Each dimension for an input MTS will go through two convolutional stages with 8 filters of length 5 with ReLU as the activation function. Each convolution is followed by a max pooling operation of length 2. The output of the second convolutional stage for all dimensions is concatenated over the channels axis and then fed to an FC layer with 732 neurons with ReLU as the activation function. Finally, the softmax classifier is used with a number of neurons equal to the number of classes in the dataset.
The first characteristic of Time-CNN is the use of the mean squared error (MSE) instead of the traditional categorical cross-entropy loss function. The network is composed of two consecutive convolutional layers with respectively 6 and 12 filters followed by a local average pooling operation of length 3. The convolutions adopt the sigmoid as the activation function. The network’s output consists of an FC layer with a number of neurons equal to the number of classes in the dataset.
This model can be considered as a traditional CNN with two convolutions followed by an FC layer and a final softmax classifier. There are two main differences with the FCNs: (1) an FC layer and (2) local max-pooling operations. For both convolutions, the ReLU activation function is used with a filter length equal to 5. For the first convolution, 5 filters are used and followed by a max pooling of length equal to 2. The second convolution uses 20 filters followed by a max pooling of length equal to 4. The convolutional blocks are followed by a non-linear fully connected layer which is composed of 500 neurons, each one using the ReLU activation function. Finally a softmax classifier.
All the tested classifiers achieved >95% accuracy except Inception Time classifier.