The completion of Portfolio Task 2: Conduct an investigation on a URL database to develop a DGA classification system using machine learning techniques is worth 20% towards your portfolio for the UFCFFY-15-M Cyber Security Analytics (CSA) module. Please refer to your Assignment Overview for full details.
For this task, you will be provided with a URL dataset. You will need to develop a machine learning tool using Python and scikit-learn that can identify URLs based on Domain Generator Algorithms (DGA), widely used by command and control malware to avoid static IP blocking. You are expected to show how a suitable set of features can be derived from the data for developing a machine learning classifier using Python data science libraries. You should also compare the results of 3 different classifiers for your task using the scikit-learn library, and provide a confusion matrix and an accuracy score for each classifier. Your portfolio submission for this task should be an HTML export of your IPYNB Jupyter notebook that details your investigation using appropriate code cells to perform the required analysis and Markdown cells to explain your work. You should include the following 3 classifiers in your study:
clf = LogisticRegression(random_state=42)
clf = RandomForestClassifier(max_depth=100, random_state=42)
clf = MLPClassifier(random_state=42, max_iter=300)
Dataset: Please see the folder "Portfolio Assignment" under the Assignment tab on Blackboard for further detail related to the access and download of the necessary dataset.
Hint: You should conduct research using the scikit-learn documentation and API reference based on the sample code provided. You should also think about a suitable means of generating input features for your classifier that capture sequential properties of text data.
|Suitable data pre-processing stages (25%)||No evidence of progress||A limited attempt to address this criteria||Some fair attempt at data pre-processing but perhaps some flaws||Good approach to the problem with minor flaws||Very good approach to the problem||Excellent approach to the problem|
|Suitable usage of machine learning library (25%)||No evidence of progress||A limited attempt to address this criteria||A partial working solution but perhaps some flaws||Good approach to the problem with minor flaws||Very good approach to the problem with good demonstration of understanding for the ML parameters under investigation||Excellent approach to the problem with very good demonstration of understanding for the ML parameters under investigation|
|Suitable experimental design and analysis of results (25%)||No evidence of progress||A limited attempt to address this criteria||Single ML experimentation and/or analysis by only accuracy||Good experimental approach, comparison of multiple methods/multiple feature reprensentations, and/or analysis by only accuracy||Deeper experimentatal approach with critique of multiple methods and feature sets, with extended analysis of results||Deeper experimentatal approach with critique of multiple methods and feature sets with extended analysis of results, to a highly professional standard|
|Clarity and professional report presentation (25%)||No evidence of progress||A limited attempt to address this criteria||Some evidence of markdown commentary but with major flaws||Markdown commentary with only minor flaws||Very good detail in markdown commentary||Excellent detail in markdown commentary|
To achieve the higher end of the grade scale, you need to demonstrate creativity in how you pre-process the data, how you utilise and assess multiple ML approaches, and then discuss the outcomes of your learning system.
Your submission for this task should include:
Your final portfolio should be submitted to Blackboard by 14:00 on 12th May 2022. Your Blackboard submission should consist of the following individual files:
Please do not ZIP the files together as a single submission on Blackboard, you can submit multiple files to Blackboard.
For each criteria, please reflect on the marking rubric and indicate what grade you would expect to receive for the work that you are submitting. For your own personal development and learning, it is important to reflect on your work and to attempt to assess this careful. Do think carefully about both positive aspects of your work, as well as any limitations you may have faced.
Suitable data pre-processing stages (25%): You estimate that your grade will be __.
Suitable usage of machine learning library (25%): You estimate that your grade will be __.
Suitable assessment of the machine learning results (25%): You estimate that your grade will be __.
Clarity and professional report presentation (25%): You estimate that your grade will be __.
Please provide a minimum of two sentences to comment and reflect on your own self-assessment: __. __.
Questions about this assignment should be directed to your module leader (Phil.Legg@uwe.ac.uk). You can use the Blackboard Q&A feature to ask questions related to this module and this assignment, as well as the on-site teaching sessions.
# Import libraries as required import pandas as pd import numpy as np import matplotlib.pyplot as plt pd.set_option('display.max_rows', 10) from collections import Counter from timeit import timeit from sklearn.ensemble import RandomForestClassifier from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay from sklearn.model_selection import train_test_split from sklearn.neural_network import MLPClassifier from sklearn.preprocessing import LabelEncoder
# Load in the data set as required df = pd.read_csv('./task2-dga/dga-24000.csv') df
24000 rows × 2 columns
# Count how many entries exist for each malware family (plus the benign class) df.value_counts('Family')
Family banjori 1000 benign 1000 tinba 1000 symmi 1000 suppobox 1000 ... locky 1000 gameover 1000 flubot 1000 emotet 1000 virut 1000 Length: 24, dtype: int64
Carry on with the investigation based on the initial code provided above. Conclude you investigation with a summary of your findings.