In this session we will cover:
import requests import pandas as pd import matplotlib.pyplot as plt from bs4 import BeautifulSoup page = requests.get('https://en.wikipedia.org/wiki/Computer_security') soup = BeautifulSoup(page.content, 'html.parser') text = soup.text.replace('\n', ' ').split(" ")
# Word-level Occurrences out = [w for w in text if len (w) > 4] plt.figure(figsize=(20,5)) pd.value_counts(out).head(50).plot(kind='bar') plt.show()
# Characeter-level Occurrences s = soup.text.lower().replace('\n', ' ') out = [c for c in s if c in 'abcdefghijklmnopqrstuvwxyz'] plt.figure(figsize=(20,5)) pd.value_counts(out).sort_index().plot(kind='bar') plt.show()
TF-IDF - accounts for occurrence of a term, whilst weighting against the occurrence in the whole document set. TF-IDF is calculated as TF * IDF, where:
Example: “Heartbleed” and “Cyber”
Here, Heartbleed is scored higher than Cyber because in relation to the overall document set it is deemed of greater significance.