In this lab, we will begin to experiment with text analytics in the context of a fake news investigation. We show how to use the TF-IDF method of identifying key terms in a text corpus, and training a machine learning model based on this. We also explore the IF-IDF further to understand what this represents for our document set.
!pip install pandas numpy matplotlib seaborn wordcloud sklearn
#! pip install nltk
#Basic libraries
import pandas as pd
import numpy as np
#Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
#Miscellanous libraries
from collections import Counter
#Ignore warnings
import warnings
warnings.filterwarnings('ignore')
#reading the fake and true datasets
fake_news = pd.read_csv('./example_data/Fake.csv')
true_news = pd.read_csv('./example_data/True.csv')
print ("Fake news: ", fake_news.shape)
print ("True news: ", true_news.shape)
fake_news.head(10)
true_news.head(10)
fake_news.iloc[0]['text']
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
X
X.todense()
plt.figure(figsize=(5,5))
plt.imshow(X.todense())
plt.show()
X.shape
vectorizer.get_feature_names_out()
plt.figure(figsize=(5,5))
plt.imshow(X.todense())
plt.xticks(list(range(X.shape[1])), labels=vectorizer.get_feature_names_out(), rotation=90)
plt.show()
We will start with a simple approach, where we now take 5 articles from each class, and build a TF-IDF vector as above.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
fake_news.iloc[0]['text'],
fake_news.iloc[1]['text'],
fake_news.iloc[2]['text'],
fake_news.iloc[3]['text'],
fake_news.iloc[4]['text'],
true_news.iloc[0]['text'],
true_news.iloc[1]['text'],
true_news.iloc[2]['text'],
true_news.iloc[3]['text'],
true_news.iloc[4]['text'],
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
X
X.todense()
plt.figure(figsize=(20,10))
start_n = 0
end_n = 160
data = X.todense()[:, start_n:end_n]
labels = vectorizer.get_feature_names_out()[start_n:end_n]
plt.imshow(data)
plt.xticks(list(range(start_n,end_n)), labels=labels, rotation=90)
plt.show()
The matrix above gives a visual representation of occurrence between classes for each word - note words such as and are clearly across all classes, whilst others may only have membership to fake (top 5 rows) or true (bottom 5 rows) classes.
# set the labels manually since we have a small set of 5 fake and 5 true
y = [0,0,0,0,0,1,1,1,1,1]
# create a simple classifier - trained only on our 5 examples of each class
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(random_state=1, max_iter=300).fit(X, y)
# predict the probabilities of the training examples - shows the class separation
clf.predict_proba(X)
We draw 5 new samples of each class to try.
# now let's try a new test set of data - draw some examples from our full dataset
new_corpus = [
fake_news.iloc[10]['text'],
fake_news.iloc[11]['text'],
fake_news.iloc[12]['text'],
fake_news.iloc[13]['text'],
fake_news.iloc[14]['text'],
true_news.iloc[10]['text'],
true_news.iloc[11]['text'],
true_news.iloc[12]['text'],
true_news.iloc[13]['text'],
true_news.iloc[14]['text'],
]
# here we transform our corpus to the TF-IDF vector - NOTE: we transform only, we do not fit again as we are testing not training.
new_X = vectorizer.transform(new_corpus)
# here we predict probabilities of new_X using our trained clf classifier...
clf.predict_proba(new_X)
Note that the probabilities are not as strongly weighted as they were earlier - shows the limited capability of this initial model (it was trained on a very small dataset). Can we improve this - let's try training a model on our full dataset (this will take us longer, but will learn from more data).
# now let's try a new test set of data - draw some examples from our full dataset
final_corpus = []
final_labels = []
for row in range(fake_news.shape[0]):
final_corpus.append(fake_news.iloc[row]['text'])
final_labels.append(0)
for row in range(true_news.shape[0]):
final_corpus.append(true_news.iloc[row]['text'])
final_labels.append(1)
new_X = vectorizer.fit_transform(final_corpus)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(new_X, final_labels, test_size=0.3, random_state=100)
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(random_state=1, max_iter=300).fit(X_train, y_train)
np.round(clf.predict_proba(X_test))[0:10]
y_test[:10]
clf.score(X_test, y_test)
Here we can inspect the predictions for the first 10 instances to assess whether they are correct. We also achieve an overall accuracy score of 99.116% with the complete dataset.