SDAV Practical Worksheet 2


The completion and sign-off of this worksheet is worth 5% towards your final mark for this module.

Task


You have been asked to examine a set of workstations that have been found to be compromised by some malware infection. For each workstation, you have a record of process and memory usage, and whether an infection has been found or whether the workstation is considered to be normal.

Your task is to develop a machine learning classifier that can distinguish between infected and normal workstations. The data has already been split to provide a training sample and a testing sample. You should train your classifier on the training sample, and then test the accuracy of the system using the testing sample. You should also use an appropriate visualisation for representing the classifier results. You will need to provide brief captions in your notebook that describe the process that you have used, how well the classifier performs, and how the visualisation helps to understand the performance of the classifier.

Assessment and Marking


Marks will be awarded for the following:

  • 2 marks for implementing a suitable classifier
  • 1 mark for assessing the classifier accuracy using the test data
  • 1 mark for using a suitable visualisation to convey the classifier
  • 1 mark for providing suitable commentary in the form of markdown cells that describe the process

Submission


You will need to demonstrate your final solution in notebook format to the module leader during the practical workshop sessions. Once this has been signed as complete by the module leader, please save your notebook as an 'HTML' file, showing all cell output, and e-mail the HTML file to the module leader (Phil.Legg@uwe.ac.uk), with an e-mail subject line: 'SDAV-WORKSHEET2'.

In [1]:
### Here are the imports that you will require
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import urllib.request

def load_data():
    # the data is a standard pcap packet capture file (saved as a csv output)
    file_name = 'memproc.csv'
    url = "http://plegg.me.uk/teaching/sdav/workshops/data/" + file_name
    # this will download the data for us from the URL and save it locally
    with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
        data = response.read()
        out_file.write(data)
    # this will then put the csv data into a pandas dataframe
    data = pd.read_csv(file_name)
    # split the data into training and testing sets
    split_value = 200
    training = data.iloc[:split_value]
    testing = data.iloc[split_value:]
    train_data = training
    #train_data = training[['host', 'proc', 'mem']]
    #train_labels = training[['state']]
    test_data = testing[['host', 'proc', 'mem']]
    test_labels = testing[['state']]
    return train_data, test_data, test_labels
In [2]:
# Here are the training and testing datasets
# split into the input data and the output labels
train_data, test_data, test_labels = load_data()
train_data
Out[2]:
host proc mem state
0 crisnd6378 -1.735788 -0.722979 Normal
1 crisnd5885 -0.568770 -1.934926 Normal
2 crisnd4508 -1.102691 -2.629311 Normal
3 crisnd6376 -2.010346 -1.778285 Normal
4 crisnd1301 -0.683525 -0.396034 Normal
5 crisnd7419 0.351669 3.218415 Infected
6 crisnd0357 -1.057801 -2.199631 Normal
7 crisnd2342 1.896578 -1.901249 Normal
8 crisnd0241 -2.294327 0.354716 Normal
9 crisnd9665 -2.156447 0.465791 Normal
10 crisnd5669 -2.247110 -1.235257 Normal
11 crisnd0426 -0.532614 0.631420 Normal
12 crisnd7183 0.948436 0.049716 Normal
13 crisnd2969 -1.873865 -1.038698 Normal
14 crisnd0194 -0.802533 -1.308481 Normal
15 crisnd9368 -1.290675 -0.437498 Normal
16 crisnd7336 0.470141 -0.546420 Normal
17 crisnd8975 -0.651143 -2.132100 Normal
18 crisnd0062 -0.351499 -0.441091 Normal
19 crisnd9002 -1.018804 -0.345687 Normal
20 crisnd7611 -0.227872 0.965092 Infected
21 crisnd4829 -0.397209 -1.606782 Normal
22 crisnd8649 -0.126401 -0.252286 Normal
23 crisnd2051 0.908639 -0.347788 Normal
24 crisnd3136 -0.929653 -0.893374 Normal
25 crisnd8444 0.540876 -1.648172 Normal
26 crisnd6107 -0.068564 -1.495283 Normal
27 crisnd2116 0.265255 -1.209587 Normal
28 crisnd3937 -0.650280 -1.510971 Normal
29 crisnd7775 -0.466349 2.177208 Infected
... ... ... ... ...
170 crisnd1262 -0.322482 -0.569976 Normal
171 crisnd4154 -0.295325 0.195858 Normal
172 crisnd2556 -0.508676 -1.509090 Normal
173 crisnd7606 -0.269611 -0.700383 Normal
174 crisnd2485 -0.703566 -0.802375 Normal
175 crisnd5284 -1.082686 -2.607552 Normal
176 crisnd4262 -1.374604 -0.705112 Normal
177 crisnd6260 -1.092681 -0.283539 Normal
178 crisnd3041 0.199040 -1.841634 Normal
179 crisnd9959 0.267240 -0.461053 Normal
180 crisnd4908 1.667540 -0.906340 Infected
181 crisnd5853 -1.438653 -0.043687 Normal
182 crisnd6669 0.944003 -1.399310 Normal
183 crisnd6827 -0.701091 -0.919277 Normal
184 crisnd1782 0.721711 -0.929448 Infected
185 crisnd0336 -0.929540 -0.213725 Normal
186 crisnd5017 -0.915000 0.023049 Normal
187 crisnd7055 -1.519754 -2.735674 Normal
188 crisnd1035 -0.831548 -1.945557 Normal
189 crisnd2312 -0.310349 -0.565440 Normal
190 crisnd4528 -0.891375 -0.297115 Normal
191 crisnd1664 2.653077 2.077163 Infected
192 crisnd9076 -1.309366 -1.335611 Normal
193 crisnd1950 -1.316272 -1.129297 Normal
194 crisnd1002 -0.709283 -2.145937 Normal
195 crisnd0548 0.734096 1.422292 Infected
196 crisnd8256 -1.939244 -2.206365 Normal
197 crisnd8738 -1.457856 0.047818 Normal
198 crisnd5780 -1.058828 -2.274430 Normal
199 crisnd0529 -2.490981 1.484893 Normal

200 rows × 4 columns

In [ ]:
 
In [ ]:
 
In [ ]: