UFCFFY-15-M Cyber Security Analytics

Assignment: Task 2


The completion of Portfolio Task 2: Conduct an investigation on a URL database to develop a DGA classification system using machine learning techniques is worth 20% towards your portfolio for the UFCFFY-15-M Cyber Security Analytics (CSA) module. Please refer to your Assignment Overview for full details.

Portfolio Task 2: Conduct an investigation on a URL database to develop a DGA classification system using machine learning techniques (20%)


For this task, you will be provided with a URL dataset. You will need to develop a machine learning tool using Python and scikit-learn that can identify URLs based on Domain Generator Algorithms (DGA), widely used by command and control malware to avoid static IP blocking. You are expected to show how a suitable set of features can be derived from the data for developing a machine learning classifier using Python data science libraries. You should also compare the results of 3 different classifiers for your task using the scikit-learn library, and provide a confusion matrix and an accuracy score for each classifier. Your portfolio submission for this task should be an HTML export of your IPYNB Jupyter notebook that details your investigation using appropriate code cells to perform the required analysis and Markdown cells to explain your work. You should include the following 3 classifiers in your study:

  • clf = LogisticRegression(random_state=42)
  • clf = RandomForestClassifier(max_depth=100, random_state=42)
  • clf = MLPClassifier(random_state=42, max_iter=300)

Dataset: Please see the folder "Portfolio Assignment" under the Assignment tab on Blackboard for further detail related to the access and download of the necessary dataset.

Hint: You should conduct research using the scikit-learn documentation and API reference based on the sample code provided. You should also think about a suitable means of generating input features for your classifier that capture sequential properties of text data.

Assessment and Marking


Criteria 0-39 40-49 50-59 60-69 70-84 85-100
Suitable data pre-processing stages (25%) No evidence of progress A limited attempt to address this criteria Some fair attempt at data pre-processing but perhaps some flaws Good approach to the problem with minor flaws Very good approach to the problem Excellent approach to the problem
Suitable usage of machine learning library (25%) No evidence of progress A limited attempt to address this criteria A partial working solution but perhaps some flaws Good approach to the problem with minor flaws Very good approach to the problem with good demonstration of understanding for the ML parameters under investigation Excellent approach to the problem with very good demonstration of understanding for the ML parameters under investigation
Suitable experimental design and analysis of results (25%) No evidence of progress A limited attempt to address this criteria Single ML experimentation and/or analysis by only accuracy Good experimental approach, comparison of multiple methods/multiple feature reprensentations, and/or analysis by only accuracy Deeper experimentatal approach with critique of multiple methods and feature sets, with extended analysis of results Deeper experimentatal approach with critique of multiple methods and feature sets with extended analysis of results, to a highly professional standard
Clarity and professional report presentation (25%) No evidence of progress A limited attempt to address this criteria Some evidence of markdown commentary but with major flaws Markdown commentary with only minor flaws Very good detail in markdown commentary Excellent detail in markdown commentary

To achieve the higher end of the grade scale, you need to demonstrate creativity in how you pre-process the data, how you utilise and assess multiple ML approaches, and then discuss the outcomes of your learning system.

Submission Documents


Your submission for this task should include:

  • 1 Jupyter Notebook exported in HTML format. You should complete your work using the iPYNB file provided (i.e., this document). You should also complete the self-assessment section (below). Once you have completed your work, you should use the export function in Jupyter to save your notebook as an HTML document. Do not submit a ipynb file - we will not execute any code during marking. Therefore, you must ensure that all cell output is clear in your HTML document for your marker.

Your final portfolio should be submitted to Blackboard by 14:00 on 12th May 2022. Your Blackboard submission should consist of the following individual files:

  • Task1.html (an HTML document exported from Jupyter notebook for Task 1)
  • Task1.ipynb (source Jupyter notebook for Task 1)
  • Task2.html (an HTML document exported from Jupyter notebook for Task 2)
  • Task2.ipynb (source Jupyter notebook for Task 2)
  • Task3.pdf (a PDF report of your research investigation for Task 3)
  • Task4.mp4 (an MP4 video file, or similar standard format - or a URL to an online video - for Task 4)

Please do not ZIP the files together as a single submission on Blackboard, you can submit multiple files to Blackboard.

Self-Assessment


For each criteria, please reflect on the marking rubric and indicate what grade you would expect to receive for the work that you are submitting. For your own personal development and learning, it is important to reflect on your work and to attempt to assess this careful. Do think carefully about both positive aspects of your work, as well as any limitations you may have faced.

  • Suitable data pre-processing stages (25%): You estimate that your grade will be __.

  • Suitable usage of machine learning library (25%): You estimate that your grade will be __.

  • Suitable assessment of the machine learning results (25%): You estimate that your grade will be __.

  • Clarity and professional report presentation (25%): You estimate that your grade will be __.

Please provide a minimum of two sentences to comment and reflect on your own self-assessment: __. __.

Contact


Questions about this assignment should be directed to your module leader (Phil.Legg@uwe.ac.uk). You can use the Blackboard Q&A feature to ask questions related to this module and this assignment, as well as the on-site teaching sessions.


In [7]:
# Import libraries as required
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', 10)

from collections import Counter
from timeit import timeit
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import LabelEncoder
In [5]:
# Load in the data set as required
df = pd.read_csv('./task2-dga/dga-24000.csv')
df
Out[5]:
Domain Family
0 google.com benign
1 facebook.com benign
2 youtube.com benign
3 twitter.com benign
4 instagram.com benign
... ... ...
23995 fhyibfwhpahb.su locky
23996 nlgusntqeqixnqyo.org locky
23997 awwduqqrjxttmn.su locky
23998 ccxmwif.pl locky
23999 yhrryqjimvgfbqrv.pw locky

24000 rows × 2 columns

In [6]:
# Count how many entries exist for each malware family (plus the benign class)
df.value_counts('Family')
Out[6]:
Family
banjori     1000
benign      1000
tinba       1000
symmi       1000
suppobox    1000
            ... 
locky       1000
gameover    1000
flubot      1000
emotet      1000
virut       1000
Length: 24, dtype: int64

Start your investigation...

Carry on with the investigation based on the initial code provided above. Conclude you investigation with a summary of your findings.

In [ ]: