Workshop: Social Media, Data Analysis, & Cartograpy, WS 2022/23
Madalina Gugulica, Alexander Dunkel, Institute of Cartography, TU Dresden
This is the fourth notebook in a series of four notebooks:
Open these notebooks through the file explorer on the left side.
The task in this notebook is very similar to the one presented in the previous notebook Part3 - Tag Maps Clustering and Topic Heat Maps, namely to explore the latent topics from LBSM data and visualize the textual information and its spatial distribution for a given topic. The emphasis in this notebook lies, however, on data exploration and analysis rather than on data visualization and focuses on a method of classifying social media posts based on the textual metadata.
Important:
The content social media users share on different platforms is extremely diverse encompassing a very wide range of topics including valuable information related to the way people perceive, relate to and use different environments. In order to harness these large volumes of data, specific tools and techniques to organize, search and understand these vast quantities of information are needed.
Text classification is a Natural Language Processing task that aims at mapping documents (in our case social media posts) into a set of predefined categories. Supervised machine learning classifiers have shown great success in performing these tasks. Nevertheless, they require large volumes of labeled data for training, which are generally not available for social media data and can be very time-consuming and expensive to obtain.
This notebook introduces a practical and unsupervised approach (which requires no labeled data) to thematically classify social media posts into specific categories (topics) simply described by a label. The underlying assumption of this approach is that Word Embeddings can be used to classify documents when no labeled training data is available.
The method is based on the comparison of the textual semantic similarity between the most relevant words in each social media post and a list of keywords for each targeted category reflecting its semantics field (in linguistics, a semantic field is a set of words grouped by their meaning, that refers to a specific subject). The strenght of this approach is represented by its simplicity, however, its success depends on a good definition for each topic reflected in the list of keywords.
How do we make machines understand text data? It is generally known that machines are experts when dealing and working with numerical data but their performance decreases if they are fed raw text data.
The idea is to create numerical representation of words that capture their meanings, semantic relationships and the different contexts they are used in. For the coversion of raw text into numbers, there are a few options out there. The simplest methodology when dealing with text is to create a word frequency matrix that simply counts the occurrence of each word ( bag-of-words). An enhance version of this method is to estimate the log scaled frequency of each word considering its occurrence in all documents (tf-idf). Nevertheless, these methods capture solely frequencies of words and no contextual information or high level semantics of text.
A recent advance on the field of Natural Language Processing proposed the use of word embeddings for the numerical representation of text.
Word embeddings are a type of word representation that allows words with similar meaning to have similar representations. Behind the approach of learning these representation lies the “distributional hypothesis” by Zellig Harris, a linguistic theory that could be summarized as: words that have similar context will have similar meanings.
Word2Vec embedding approach Word2Vec is one of the most popular word embedding models and it was introduced in two papers between September and October 2013, by a team of researchers at Google and is considered the state of the art. Word2Vec approach uses deep learning and neural networks-based techniques to convert words into corresponding vectors in such a way that the semantically similar vectors are close to each other in N-dimensional vector space, where N refers to the dimensions of the vector. As words that have similar neighbouring words are likely semantically similar, this means that the word2vec approach is very good at capturing semantic relationships.
Word2Vec's ability to capture and maintain semantic relationships is reflected by a famous classic example where if you have a vector for the word "King" and you remove the vector represented by the word "Man" from the "King" and add "Woman" to it, you get a vector which is close to the "Queen" vector. This relation is commonly represented as: King - Man + Woman = Queen.
Word2Vec can be used to find out the relations between words in a dataset, compute the similarity between them, or use the vector representation of those words as input for applications such as text classification.
For more information on Word Embeddings this could be a starting reading material: https://ruder.io/word-embeddings-1/index.html
The method adopted computes the semantic similarity between different words or group of words and determines which words are semantically related to each other and belong to the same semantic field. Furthermore, computing the distance in the vector space (cosine distance) between the centroid of the word vectors that belong to a certain topic (semantic field) and the centroid of the word vectors that compose a social media posts will allow us to verify if the textual metadata associated with the posts is related to a specific category (Binary Classification).
In this notebook we will implement the Word2Vec word embedding technique used for creating word vectors with Python's Gensim library.
Load Dependencies
import pandas as pd
import pickle
import scipy
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from IPython.display import clear_output, Markdown, display
We're creating several output graphics and temporary files.
These will be stored in the subfolder notebooks/out/.
from pathlib import Path
OUTPUT = Path.cwd() / "out"
OUTPUT.mkdir(exist_ok=True)
Parameters were chosen according to the semantic similarity performance results reported in Efficient Estimation of Word Representations in Vector Space - T. Mikolov et al.(2013)
%load_ext autoreload
%autoreload 2
Prepare paths..
import sys
INPUT = Path.cwd() / "input"
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
sys.path.append(module_path)
from modules import tools
source = "topic_data.zip"
Download sample data. This may take some time.
%%time
sample_url = tools.get_sample_url()
zip_uri = f'{sample_url}/download?path=%2F&files='
tools.get_zip_extract(
uri=zip_uri,
filename=source,
output_path=INPUT,
write_intermediate=True)
Load the pretrained word2vec model usign gensim's Word2Vec
from gensim import utils
from gensim.models import Word2Vec
model_w2v = Word2Vec.load(
str(INPUT / "word2vec.model"))
For the creation of the post_embedding we will need idf-score weights, which were prepared beforehand and stored in the input folder as a serialized (pickled) dictionary
#idf-scores dictionary deserialization
with open(INPUT / 'idf_scores_dict.pkl', 'rb') as handle:
idf_scores_dict = pickle.load(handle)
Functions to compute the average topic and post vectors
def avg_topic_vector(lang_model, tokens_list):
# remove out-of-vocabulary words
tokens = []
for token in tokens_list:
if token in lang_model.wv.vocab:
tokens.append(token)
return np.average(lang_model[tokens], axis=0)
def avg_post_vector(lang_model, tokens_list,idf):
# remove out-of-vocabulary words
tokens = []
weights = []
for token in tokens_list:
if token in lang_model.wv.vocab:
tokens.append(token)
tf = tokens_list.count(token)
tfidf= tf*idf[token]
weights.append(tfidf)
return np.average(lang_model[tokens], weights =weights, axis=0)
def has_vector_representation(lang_model, upl):
"""check if at least one word of the document is in the
word2vec dictionary"""
n= len([w for w in upl if w in lang_model.wv.vocab])
if n>0:
return True
else:
return False
The textual content of social media data has a low degree of formal semantic and syntactic accuracy. In order to provide only significant information for the text classification task to be performed, the text (post_title, post_ body and tags) needed to be preprocessed according to the following actions:
filename = "DD_Neustadt_NormalizedInstagramPosts.pickle"
df = pd.read_pickle(INPUT / filename)
print(len(df))
df.head()
The classification of the social media posts is based on the calculation of the similarity score (cosine similarity) between a topic embedding & the post embeddings and follows the workflow exposed below:
for each label (topic) a list of relevant keywords is defined and enhanced by seeking further semantically similar words through the identification of the most similar word vectors (which are closely located in the vector space)
a topic embedding will be created by averaging all the vectors representing the keywords in the list previously defined
the vector represention of each social media post is created by averaging the weighted word embeddings of all words and the weight of a word is given by its tf-idf score.
topic_list = ['event','music','festival','concert']
enhanced_list = []
for keyword in topic_list:
similar_words = model_w2v.wv.most_similar(
positive = [keyword], topn = 50)
enhanced_list += ([w[0] for w in similar_words])
some words might repeat, therefore we will save the list as a set of unique strings
topic_list = topic_list + enhanced_list
topic_list = set(topic_list)
To create the topic embedding call the previously defined avg_topic_vector function.
topic_embedding = avg_topic_vector(model_w2v, topic_list)
To visualize the enhanced list of keywords representative for the chosen topic and used for the calculation of the topic embedding, we use the WordCloud library.
Note: Some of the tags identified might refer to the city of Heidelberg since the word2vec model was trained on social media posts that were published within both Dresden and Heidelberg.
import matplotlib.pyplot as plt
from wordcloud import WordCloud
words = ' '.join(topic_list)
wordcloud = WordCloud(background_color="white").generate(words)
# Display the generated image:
plt.figure(figsize = (10,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
%%time
df = df.reindex(df.columns.tolist() + ['classification','cos_dist'], axis=1)
x = 0
total_records = len(df)
for index, row in df.iterrows():
x+=1
msg_text = (
f'Processed records: {x} ({x/(total_records/100):.2f}%). ')
if x % 100 == 0:
clear_output(wait=True)
print(msg_text)
text = row['post_text'].split(' ')
if has_vector_representation(model_w2v, text) == True:
#create the post embedding
post_embedding = avg_post_vector(model_w2v, text,idf_scores_dict)
cos_dist = scipy.spatial.distance.cosine(topic_embedding, post_embedding, w=None)
if cos_dist <0.3:
df.at[index,'classification'] = 1
df.at[index,'cos_dist'] = cos_dist
else:
df.at[index,'classification'] = 0
df.at[index,'cos_dist'] = cos_dist
# final status
clear_output(wait=True)
print(msg_text)
df.to_pickle(OUTPUT/ 'DD_Neustadt_ClassifiedInstagramPosts.pickle')
df_classified = df[df['classification'] == 1]
print ("The algorithm identified", len(df_classified), "social media posts related to music events in Dresden Neustadt")
Load dependencies
import geopandas as gp
import holoviews as hv
import geoviews as gv
from cartopy import crs as ccrs
hv.notebook_extension('bokeh')
Convert the pandas dataframe into a geopandas dataframe
df_classified is a subset of the original dataframe and the index values will correspond with the ones in the orginal dataframe.We will reset the index values so the first record of the subset gets the index 0.
df_classified.reset_index()
gdf = gp.GeoDataFrame(
df_classified, geometry=gp.points_from_xy(df_classified.longitude, df_classified.latitude))
CRS_PROJ = "epsg:3857" # Web Mercator
CRS_WGS = "epsg:4326" # WGS1984
gdf.crs = CRS_WGS # Set projection
gdf = gdf.to_crs(CRS_PROJ) # Project
Have a look at the geodataframe
gdf.head()
x = gdf.loc[gdf.first_valid_index()].geometry.x
y = gdf.loc[gdf.first_valid_index()].geometry.y
margin = 1000 # meters
bbox_bottomleft = (x - margin, y - margin)
bbox_topright = (x + margin, y + margin)
gdf.loc[0]
is the loc-indexer from pandas. It means: access the first record of the (Geo)DataFrame
.geometry.x
is used to access the (projected) x coordinate geometry (point). This is only available for GeoDataFrame (geopandas)posts_layer = gv.Points(
df_classified,
kdims=['longitude', 'latitude'],
vdims=['post_text'],
label='Instagram Post')
from bokeh.models import HoverTool
from typing import Dict, Optional
def get_custom_tooltips(
items: Dict[str, str]) -> str:
"""Compile HoverTool tooltip formatting with items to show on hover"""
tooltips = ""
if items:
tooltips = "".join(
f'<div><span style="font-size: 12px;">'
f'<span style="color: #82C3EA;">{item}:</span> '
f'@{item}'
f'</span></div>' for item in items)
return tooltips
def set_active_tool(plot, element):
"""Enable wheel_zoom in bokeh plot by default"""
plot.state.toolbar.active_scroll = plot.state.tools[0]
# prepare custom HoverTool
tooltips = get_custom_tooltips(items=['post_text'])
hover = HoverTool(tooltips=tooltips)
gv_layers = hv.Overlay(
gv.tile_sources.CartoDark * \
posts_layer.opts(
tools=['hover'],
size=8,
line_color='black',
line_width=0.1,
fill_alpha=0.8,
fill_color='#ccff00')
)
Store map as static HTML file
gv_layers.opts(
projection=ccrs.GOOGLE_MERCATOR,
title= "Music Festivals and Concerts in Dresden Neustadt according to Instagram Posts",
responsive=True,
xlim=(bbox_bottomleft[0], bbox_topright[0]),
ylim=(bbox_bottomleft[1], bbox_topright[1]),
data_aspect=0.45, # maintain fixed aspect ratio during responsive resize
hooks=[set_active_tool])
hv.save(
gv_layers, OUTPUT / f'topic_map.html', backend='bokeh')
Display in-line view of the map:
gv_layers.opts(
width=800,
height=480,
responsive=False,
hooks=[set_active_tool],
title= "Music Festivals and Concerts in Dresden Neustadt according to Instagram Posts" ,
projection=ccrs.GOOGLE_MERCATOR,
data_aspect=1,
xlim=(bbox_bottomleft[0], bbox_topright[0]),
ylim=(bbox_bottomleft[1], bbox_topright[1])
)
!jupyter nbconvert --to html \
--output-dir=./out/ ./04_topic_classification.ipynb \
--template=../nbconvert.tpl \
--ExtractOutputPreprocessor.enabled=False >&- 2>&-
Clean up input folder
tools.clean_folders(
[Path.cwd() / "input"])
Text classification is a complex NLP related task and a detailed description of it is beyond the scope of this notebook. This notebook is a very brief introduction of the main NLP concepts, methods and tools utilised in topic-based text classification.
The method presented was developed for classifying short text such as social media posts which raises in general a series of issues related to its unstructured and noisy nature.
The performance evaluation of the classification revealed decent results (F1-Score > 0.6). However, as observed during the performance evaluation process, misclassification occurs mainly due to two general open issue in text mining, polysemy and synonymy. Therefore, to improve the performance of the classifier, word sense disambiguation methods need to be implemented in the algorithm and it is parth of future work
root_packages = [
'python', 'geoviews', 'holoviews', 'ipywidgets', 'geopandas',
'shapely',
'matplotlib', 'sklearn', 'numpy', 'pandas', 'bokeh', 'gensim', 'wordcloud']
tools.package_report(root_packages)