UFCFFY-15-M Cyber Security Analytics

Practical Lab 10: Text Analytics


In this lab, we will begin to experiment with text analytics in the context of a fake news investigation. We show how to use the TF-IDF method of identifying key terms in a text corpus, and training a machine learning model based on this. We also explore the IF-IDF further to understand what this represents for our document set.

In [1]:
!pip install pandas numpy matplotlib seaborn wordcloud sklearn 
#! pip install nltk

#Basic libraries
import pandas as pd 
import numpy as np 

#Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

#Miscellanous libraries
from collections import Counter

#Ignore warnings
import warnings
warnings.filterwarnings('ignore')
Requirement already satisfied: pandas in c:\python39\lib\site-packages (1.3.4)
Requirement already satisfied: numpy in c:\python39\lib\site-packages (1.20.3)
Requirement already satisfied: matplotlib in c:\python39\lib\site-packages (3.4.3)
Requirement already satisfied: seaborn in c:\python39\lib\site-packages (0.11.2)
Requirement already satisfied: wordcloud in c:\python39\lib\site-packages (1.8.1)
Requirement already satisfied: sklearn in c:\python39\lib\site-packages (0.0)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\python39\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in c:\python39\lib\site-packages (from pandas) (2021.3)
Requirement already satisfied: cycler>=0.10 in c:\python39\lib\site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\python39\lib\site-packages (from matplotlib) (1.3.2)
Requirement already satisfied: pillow>=6.2.0 in c:\python39\lib\site-packages (from matplotlib) (8.4.0)
Requirement already satisfied: pyparsing>=2.2.1 in c:\python39\lib\site-packages (from matplotlib) (2.4.7)
Requirement already satisfied: scipy>=1.0 in c:\python39\lib\site-packages (from seaborn) (1.7.1)
Requirement already satisfied: scikit-learn in c:\python39\lib\site-packages (from sklearn) (1.0)
Requirement already satisfied: six in c:\python39\lib\site-packages (from cycler>=0.10->matplotlib) (1.16.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\python39\lib\site-packages (from scikit-learn->sklearn) (3.0.0)
Requirement already satisfied: joblib>=0.11 in c:\python39\lib\site-packages (from scikit-learn->sklearn) (1.1.0)
WARNING: You are using pip version 21.3.1; however, version 22.0.4 is available.
You should consider upgrading via the 'C:\Python39\python.exe -m pip install --upgrade pip' command.
In [2]:
#reading the fake and true datasets
fake_news = pd.read_csv('./example_data/Fake.csv')
true_news = pd.read_csv('./example_data/True.csv')

print ("Fake news: ", fake_news.shape)
print ("True news: ", true_news.shape)
Fake news:  (23481, 4)
True news:  (21417, 4)
In [3]:
fake_news.head(10)
Out[3]:
title text subject date
0 Donald Trump Sends Out Embarrassing New Year’... Donald Trump just couldn t wish all Americans ... News December 31, 2017
1 Drunk Bragging Trump Staffer Started Russian ... House Intelligence Committee Chairman Devin Nu... News December 31, 2017
2 Sheriff David Clarke Becomes An Internet Joke... On Friday, it was revealed that former Milwauk... News December 30, 2017
3 Trump Is So Obsessed He Even Has Obama’s Name... On Christmas day, Donald Trump announced that ... News December 29, 2017
4 Pope Francis Just Called Out Donald Trump Dur... Pope Francis used his annual Christmas Day mes... News December 25, 2017
5 Racist Alabama Cops Brutalize Black Boy While... The number of cases of cops brutalizing and ki... News December 25, 2017
6 Fresh Off The Golf Course, Trump Lashes Out A... Donald Trump spent a good portion of his day a... News December 23, 2017
7 Trump Said Some INSANELY Racist Stuff Inside ... In the wake of yet another court decision that... News December 23, 2017
8 Former CIA Director Slams Trump Over UN Bully... Many people have raised the alarm regarding th... News December 22, 2017
9 WATCH: Brand-New Pro-Trump Ad Features So Muc... Just when you might have thought we d get a br... News December 21, 2017
In [4]:
true_news.head(10)
Out[4]:
title text subject date
0 As U.S. budget fight looms, Republicans flip t... WASHINGTON (Reuters) - The head of a conservat... politicsNews December 31, 2017
1 U.S. military to accept transgender recruits o... WASHINGTON (Reuters) - Transgender people will... politicsNews December 29, 2017
2 Senior U.S. Republican senator: 'Let Mr. Muell... WASHINGTON (Reuters) - The special counsel inv... politicsNews December 31, 2017
3 FBI Russia probe helped by Australian diplomat... WASHINGTON (Reuters) - Trump campaign adviser ... politicsNews December 30, 2017
4 Trump wants Postal Service to charge 'much mor... SEATTLE/WASHINGTON (Reuters) - President Donal... politicsNews December 29, 2017
5 White House, Congress prepare for talks on spe... WEST PALM BEACH, Fla./WASHINGTON (Reuters) - T... politicsNews December 29, 2017
6 Trump says Russia probe will be fair, but time... WEST PALM BEACH, Fla (Reuters) - President Don... politicsNews December 29, 2017
7 Factbox: Trump on Twitter (Dec 29) - Approval ... The following statements were posted to the ve... politicsNews December 29, 2017
8 Trump on Twitter (Dec 28) - Global Warming The following statements were posted to the ve... politicsNews December 29, 2017
9 Alabama official to certify Senator-elect Jone... WASHINGTON (Reuters) - Alabama Secretary of St... politicsNews December 28, 2017
In [5]:
fake_news.iloc[0]['text']
Out[5]:
'Donald Trump just couldn t wish all Americans a Happy New Year and leave it at that. Instead, he had to give a shout out to his enemies, haters and  the very dishonest fake news media.  The former reality show star had just one job to do and he couldn t do it. As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year,  President Angry Pants tweeted.  2018 will be a great year for America! As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year. 2018 will be a great year for America!  Donald J. Trump (@realDonaldTrump) December 31, 2017Trump s tweet went down about as welll as you d expect.What kind of president sends a New Year s greeting like this despicable, petty, infantile gibberish? Only Trump! His lack of decency won t even allow him to rise above the gutter long enough to wish the American citizens a happy new year!  Bishop Talbert Swan (@TalbertSwan) December 31, 2017no one likes you  Calvin (@calvinstowell) December 31, 2017Your impeachment would make 2018 a great year for America, but I ll also accept regaining control of Congress.  Miranda Yaver (@mirandayaver) December 31, 2017Do you hear yourself talk? When you have to include that many people that hate you you have to wonder? Why do the they all hate me?  Alan Sandoval (@AlanSandoval13) December 31, 2017Who uses the word Haters in a New Years wish??  Marlene (@marlene399) December 31, 2017You can t just say happy new year?  Koren pollitt (@Korencarpenter) December 31, 2017Here s Trump s New Year s Eve tweet from 2016.Happy New Year to all, including to my many enemies and those who have fought me and lost so badly they just don t know what to do. Love!  Donald J. Trump (@realDonaldTrump) December 31, 2016This is nothing new for Trump. He s been doing this for years.Trump has directed messages to his  enemies  and  haters  for New Year s, Easter, Thanksgiving, and the anniversary of 9/11. pic.twitter.com/4FPAe2KypA  Daniel Dale (@ddale8) December 31, 2017Trump s holiday tweets are clearly not presidential.How long did he work at Hallmark before becoming President?  Steven Goodine (@SGoodine) December 31, 2017He s always been like this . . . the only difference is that in the last few years, his filter has been breaking down.  Roy Schulze (@thbthttt) December 31, 2017Who, apart from a teenager uses the term haters?  Wendy (@WendyWhistles) December 31, 2017he s a fucking 5 year old  Who Knows (@rainyday80) December 31, 2017So, to all the people who voted for this a hole thinking he would change once he got into power, you were wrong! 70-year-old men don t change and now he s a year older.Photo by Andrew Burton/Getty Images.'

Example: TF-IDF Vectoriser

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
X
Out[23]:
<4x9 sparse matrix of type '<class 'numpy.float64'>'
	with 21 stored elements in Compressed Sparse Row format>
In [10]:
X.todense()
Out[10]:
matrix([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
         0.        , 0.38408524, 0.        , 0.38408524],
        [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
         0.53864762, 0.28108867, 0.        , 0.28108867],
        [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
         0.        , 0.26710379, 0.51184851, 0.26710379],
        [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
         0.        , 0.38408524, 0.        , 0.38408524]])
In [17]:
plt.figure(figsize=(5,5))
plt.imshow(X.todense())
plt.show()
In [8]:
X.shape
Out[8]:
(4, 9)
In [9]:
vectorizer.get_feature_names_out()
Out[9]:
array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)
In [22]:
plt.figure(figsize=(5,5))
plt.imshow(X.todense())
plt.xticks(list(range(X.shape[1])), labels=vectorizer.get_feature_names_out(), rotation=90)
plt.show()

Let's try this on our Fake News Dataset....

We will start with a simple approach, where we now take 5 articles from each class, and build a TF-IDF vector as above.

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    fake_news.iloc[0]['text'],
    fake_news.iloc[1]['text'],
    fake_news.iloc[2]['text'],
    fake_news.iloc[3]['text'],
    fake_news.iloc[4]['text'],
    true_news.iloc[0]['text'],
    true_news.iloc[1]['text'],
    true_news.iloc[2]['text'],
    true_news.iloc[3]['text'],
    true_news.iloc[4]['text'],  
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
X
Out[24]:
<10x1590 sparse matrix of type '<class 'numpy.float64'>'
	with 2650 stored elements in Compressed Sparse Row format>
In [25]:
X.todense()
Out[25]:
matrix([[0.        , 0.02440592, 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.03062055,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.01480304, 0.0199038 , ..., 0.        , 0.        ,
         0.        ]])
In [47]:
plt.figure(figsize=(20,10))
start_n = 0
end_n = 160
data = X.todense()[:, start_n:end_n]
labels = vectorizer.get_feature_names_out()[start_n:end_n]
plt.imshow(data)
plt.xticks(list(range(start_n,end_n)), labels=labels, rotation=90)
plt.show()

The matrix above gives a visual representation of occurrence between classes for each word - note words such as and are clearly across all classes, whilst others may only have membership to fake (top 5 rows) or true (bottom 5 rows) classes.

In [52]:
# set the labels manually since we have a small set of 5 fake and 5 true
y = [0,0,0,0,0,1,1,1,1,1]
In [53]:
# create a simple classifier - trained only on our 5 examples of each class
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(random_state=1, max_iter=300).fit(X, y)
In [56]:
# predict the probabilities of the training examples - shows the class separation
clf.predict_proba(X)
Out[56]:
array([[0.99275323, 0.00724677],
       [0.99100087, 0.00899913],
       [0.99554437, 0.00445563],
       [0.99333115, 0.00666885],
       [0.99251297, 0.00748703],
       [0.00353847, 0.99646153],
       [0.0056744 , 0.9943256 ],
       [0.00630346, 0.99369654],
       [0.00840374, 0.99159626],
       [0.00557062, 0.99442938]])

Using the classifier on a new test dataset

We draw 5 new samples of each class to try.

In [61]:
# now let's try a new test set of data - draw some examples from our full dataset

new_corpus = [
    fake_news.iloc[10]['text'],
    fake_news.iloc[11]['text'],
    fake_news.iloc[12]['text'],
    fake_news.iloc[13]['text'],
    fake_news.iloc[14]['text'],
    true_news.iloc[10]['text'],
    true_news.iloc[11]['text'],
    true_news.iloc[12]['text'],
    true_news.iloc[13]['text'],
    true_news.iloc[14]['text'],  
]

# here we transform our corpus to the TF-IDF vector - NOTE: we transform only, we do not fit again as we are testing not training.
new_X = vectorizer.transform(new_corpus)
# here we predict probabilities of new_X using our trained clf classifier...
clf.predict_proba(new_X)
Out[61]:
array([[0.42758047, 0.57241953],
       [0.48673985, 0.51326015],
       [0.35945337, 0.64054663],
       [0.43636145, 0.56363855],
       [0.28241462, 0.71758538],
       [0.15684186, 0.84315814],
       [0.11243396, 0.88756604],
       [0.40300811, 0.59699189],
       [0.57990193, 0.42009807],
       [0.24019662, 0.75980338]])

Note that the probabilities are not as strongly weighted as they were earlier - shows the limited capability of this initial model (it was trained on a very small dataset). Can we improve this - let's try training a model on our full dataset (this will take us longer, but will learn from more data).

In [78]:
# now let's try a new test set of data - draw some examples from our full dataset

final_corpus = []
final_labels = []

for row in range(fake_news.shape[0]):
    final_corpus.append(fake_news.iloc[row]['text'])
    final_labels.append(0)
for row in range(true_news.shape[0]):
    final_corpus.append(true_news.iloc[row]['text'])
    final_labels.append(1)

new_X = vectorizer.fit_transform(final_corpus)
In [85]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(new_X, final_labels, test_size=0.3, random_state=100)
In [86]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(random_state=1, max_iter=300).fit(X_train, y_train)
In [102]:
np.round(clf.predict_proba(X_test))[0:10]
Out[102]:
array([[1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.]])
In [103]:
y_test[:10]
Out[103]:
[0, 1, 1, 1, 0, 1, 1, 0, 1, 0]
In [89]:
clf.score(X_test, y_test)
Out[89]:
0.9911655530809206

Here we can inspect the predictions for the first 10 instances to assess whether they are correct. We also achieve an overall accuracy score of 99.116% with the complete dataset.

In [ ]: