Part 3: Tag Maps Clustering and Topic Heat Maps


Workshop: Social Media, Data Analysis, & Cartograpy, WS 2023/24

Alexander Dunkel
Leibniz Institute of Ecological Urban and Regional Development, Transformative Capacities & Research Data Centre & Technische Universität Dresden, Institute of Cartography

•••

Last updated: Feb-12-2024, Carto-Lab Docker Version 0.12.3

This is the third notebook in a series of four notebooks:

  1. Introduction to Social Media data, jupyter and python spatial visualizations
  2. Introduction to privacy issues with Social Media data and possible solutions for cartographers
  3. Specific visualization techniques example: TagMaps clustering
  4. Specific data analysis: Topic Classification

Open these notebooks through the file explorer on the left side.

Introduction: Privacy & Social Media

The task in this notebook is to extract and visualize common areas or regions of interest for a given topic from Location Based Social Media data (LBSM).

On Social Media, people react to many things, and a significant share of these reactions can be seen as forms of valuation or attribution of meaning to to the physical environment.

Challenges

However, for visualizing such values for whole cities, there are several challenges involved:

  • Identify reactions that are related to a given topic:
    • Topics can be considered at different levels of granularity
    • and equal terms may convey different meanings in different contexts.
  • Social Media data is distributed very irregularly.
  • Data peaks in urban areas and highly frequented tourist hotspots,
  • while some less frequented areas do not feature any data at all.
  • When visualizing spatial reactions to topics, this unequal distribution must be taking into account, otherwise maps would always be biased towards areas with higher density of information.
  • For this reason, normalizing results is crucial.
  • The usual approach for normalization in such cases is that local distribution is evaluated against the global (e.g. spatial) distribution data.
Tag Maps Package
  • The Tag Maps package was developed to cluster tags according to their spatial area of use
  • Currently, the final visualization step of Tag Maps clustering (placing labels etc.) is only implemented in ArcMap
  • Herein, we instead explore cluster results using Heat Maps and other graphics in python
  • In the second part of the workshop, it will be shown how to visualize the actual Tag Maps
Where is my custom area/tag map?
  • Below, you can chose between different regions/areas to explore in this notebook, provided based on student input
  • The final Tag Map visualization step will be done in the second part of the workshop
  • This will be done in ArcMap. However, it is possible to visualize TagMaps in jupyter Lab, too. The notebook is not yet ready - have a look at the work in progress here

Preparations

Define output directory

In [2]:
from pathlib import Path
OUTPUT = Path.cwd() / "out"

Temporary fix to prevent proj-path warning:

In [3]:
import sys, os
os.environ["PROJ_LIB"] = str(Path(sys.executable).parents[1] / 'share' / 'proj')

Create 01_Input folder, which will be used for loading data.

In [4]:
INPUT = Path.cwd() / "01_Input"

Load the TagMaps package, which will serve as a base for filtering, cleaning and processing noisy social media data

In [5]:
from tagmaps import TagMaps, EMOJI, TAGS, TOPICS, LoadData, BaseConfig
from tagmaps.classes.shared_structure import ItemCounter
In [6]:
%load_ext autoreload
%autoreload 2

Import tools used across different notebooks

In [7]:
import sys

module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
from modules import preparations
preparations.init_packages()
from modules import tools

Load Data & Plot Overview

Retrieve sample LBSN CSV data:

In [8]:
source = "GrosserGarten.zip"
# source = "Istanbul.zip"
# source = "Lindau.zip"
# source = "tumunich.zip"
# source = "ArgentinaCarhue.zip"
# source = "MachuPicchu.zip"
# source = "Rudolf-Harbig-Stadion.zip"
# source = "DE_Tempelhof"
# source = "SaechsischeSchweiz.zip"
# source = "Zaragoza.zip"
# source = "DresdenNeustadt.zip"
# source = "Pamplona.zip"
In [9]:
%%time
# clean any data first
tools.clean_folders([INPUT])
sample_url = tools.get_sample_url()
lbsn_dd_csv_uri = f'{sample_url}/download?path=%2F&files='
tools.get_zip_extract(
    uri=lbsn_dd_csv_uri,
    filename=source,
    output_path=INPUT,
    write_intermediate=True)
Loaded 35.86 MB of 35.87 (100%)..
Extracting zip..
Retrieved GrosserGarten.zip, extracted size: 100.03 MB
CPU times: user 1.71 s, sys: 625 ms, total: 2.34 s
Wall time: 8.39 s

Initialize tag maps from BaseConfig

In [10]:
tm_cfg = BaseConfig()
Folder 02_Output/ was created
Loaded 337 stoplist items.
Loaded 214 inStr stoplist items.
Loaded 4 stoplist places.
Loaded 3 place lat/lng corrections.

Optionally, filter data per origin or adjust the number of top terms to extract:

In [11]:
tm_cfg.filter_origin = None
tm_cfg.max_items = 3000
In [12]:
tm_opts = {
    'tag_cluster':tm_cfg.cluster_tags,
    'emoji_cluster':tm_cfg.cluster_emoji,
    'location_cluster':tm_cfg.cluster_locations,
    'output_folder':tm_cfg.output_folder,
    'remove_long_tail':tm_cfg.remove_long_tail,
    'limit_bottom_user_count':tm_cfg.limit_bottom_user_count,
    'max_items':tm_cfg.max_items,
    'topic_cluster':True}
tm = TagMaps(**tm_opts)

a) Read from original data

Read input records from csv

In [13]:
from IPython.display import clear_output
In [14]:
%%time
input_data = LoadData(tm_cfg)
with input_data as records:
    for ix, record in enumerate(records):
        tm.add_record(record)
        if (ix % 1000) != 0:
            continue
        # report every 1000 records
        clear_output(wait=True)
        print(f'Loaded {ix} records')
input_data.input_stats_report()
tm.prepare_data()
Loaded 244000 records
Cleaned input to 18229 distinct locations from 244942 posts (File 2 of 2) - Skipped posts: 0 - skipped tags: 146435 of 1596294

Total post count (PC): 244942
Total tag count (PTC): 1596294
Total emoji count (PEC): 168177
Long tail removal: Filtered 336 Emoji that were used by less than 2 users.
Long tail removal: Filtered 157643 Topics that were used by less than 5 users.
CPU times: user 1min 3s, sys: 1.21 s, total: 1min 4s
Wall time: 1min 4s

b) Optional: Write (& Read) cleaned output to file

Output cleaned data to Output/Output_cleaned.csv,
clean terms (post_body and tags) based on the top (e.g.) 1000 hashtags found in dataset.

In [15]:
tm.lbsn_data.write_cleaned_data(panon=True)
Writing cleaned intermediate data to file (Output_cleaned.csv)..
 done.

Have a look at the output file.

In [16]:
import pandas as pd

file_path = Path.cwd() / "02_Output" / "Output_cleaned.csv"
display(pd.read_csv(file_path, nrows=5).head())
print(f'{file_path.stat().st_size / (1024*1024):.02f} MB')
origin_id lat lng guid user_guid loc_id post_create_date post_publish_date post_body hashtags emoji post_views_count post_like_count loc_name
0 1 51.033300 13.733300 3722a733575bfbd6c173dc2deadb795b 3722a733575bfbd6c173dc2deadb795b 51.0333:13.7333 NaN NaN fürstenzug NaN NaN 704 0 Fürstenzug
1 1 51.033300 13.733300 ac28e6f721e25624d76a9eb3afc61b4f ac28e6f721e25624d76a9eb3afc61b4f 51.0333:13.7333 NaN NaN NaN NaN NaN 194 0 Hochschulstraße
2 1 51.033300 13.733300 d71246d019402566900f0d59d0c9fbef d71246d019402566900f0d59d0c9fbef 51.0333:13.7333 NaN NaN stadtfest;dresdner NaN NaN 181 0 Dresdner Stadtfest
3 1 51.031945 13.730668 0ff8be1630ab8decae97efecfda9f5f3 0ff8be1630ab8decae97efecfda9f5f3 51.031945353:13.7306680064 NaN NaN club NaN NaN 228 0 Club 11
4 1 51.031224 13.730091 aab11415bd4546adef5f513ac00edc90 aab11415bd4546adef5f513ac00edc90 51.0312241:13.7300907 NaN NaN international;house NaN NaN 66 0 International Guest House
46.02 MB

Read from pre-filtered data

Read data form (already prepared and filtered) cleaned output

In [17]:
%%time
tm.prepare_data(
    input_path=file_path)
CPU times: user 4.13 ms, sys: 2 µs, total: 4.13 ms
Wall time: 4.14 ms
In [18]:
tm.global_stats_report()
tm.item_stats_report()
Total user count (UC): 103684
Total user post locations (UPL): 168815
Number of locations with names: 2294
Total distinct tags (DTC): 174260
Total distinct emoji (DEC): 2003
Total distinct locations (DLC): 18229
Total tag count for the 3000 most used tags in selected area: 1778500.
Total emoji count for the 1667 most used emoji in selected area: 167809.
Bounds are: Min 13.727952 51.0257536334 Max 13.78818 51.051867

Topic selection & Tag Clustering

The next goal is to select reactions to given topics. TagMaps allows selecting posts for 4 different types:

  • TAGS (i.e. single terms)
  • EMOJI (i.e. single emoji)
  • LOCATIONS (i.e. named POIs or coordinate pairs)
  • TOPICS (i.e. list of terms)

Set basic plot to notebook mode and disable interactive plots for now

In [19]:
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
mpl.rcParams['savefig.dpi'] = 120
mpl.rcParams['figure.dpi'] = 120

We can retrieve preview maps for the TOPIC dimension
by supplying a list of terms.

For example "park", "green" and "grass" should
give us an overview where such terms are used
in our sample area.

In [20]:
nature_terms = ["park", "grass", "nature"]
fig = tm.get_selection_map(
    TOPICS, nature_terms)
In [21]:
urban_terms = ["strasse", "city", "shopping"]
fig = tm.get_selection_map(
    TOPICS, urban_terms)

HDBSCAN Cluster

We can visualize clusters for the selected topic using HDBSCAN.

The important parameter for HDBSCAN is the cluster distance,
which is chosen automatically by Tag Maps given the current scale/extent of analysis.

In [22]:
# tm.clusterer[TOPICS].cluster_distance = 150
fig = tm.get_cluster_map(
    TOPICS, nature_terms)

We can get a map of clusters and cluster shapes (convex and concave hulls).

In [23]:
fig = tm.get_cluster_shapes_map(
    TOPICS, nature_terms)
--> 31 cluster.

Behind the scenes, Tag Maps utilizes the Single Linkage Tree from HDBSCAN to cut clusters at a specified distance. This tree shows the hierarchical structure for our topic and its spatial properties in the given area.

In [24]:
fig = tm.get_singlelinkagetree_preview(
    TOPICS, nature_terms)

Cluster centroids

Similarly, we can retrieve centroids of clusters. This shows again the unequal distribution of data:

In [25]:
fig = tm.clusterer[TOPICS].get_cluster_centroid_preview(
    nature_terms, single_clusters=True)
(1 of 16617) Found 2928 posts (UPL) for Topic 'park-grass-nature' (found in 6% of DLC in area) --> 31 cluster.

Heat Maps

  • Visualization of clustered tagmaps data is possible in several ways.
  • In the second workshop (end of January), we are using ArcMap, to create labeled maps from clustered data
  • The last part in this notebook will be to use Kernel Densitry Estimation (KDE) to create a Heat Map for selected topics.

Load additional dependencies

In [26]:
import numpy as np
from sklearn.neighbors import KernelDensity

Load Flickr data only

Instagram, Facebook and Twitter are based on place-accuracy, which is unsuitable for the Heat Map graphic.

We'll work with Flickr data only for the Heat Map.

  • 1 = Instagram
  • 2 = Flickr
  • 3 = Twitter

Reload data, filtering only Flickr

In [27]:
tm_cfg.filter_origin = "2"
tm = TagMaps(**tm_opts)
In [28]:
%%time
input_data = LoadData(tm_cfg)
with input_data as records:
    for ix, record in enumerate(records):
        tm.add_record(record)
        if (ix % 1000) != 0:
            continue
        # report every 1000 records
        clear_output(wait=True)
        print(f'Loaded {ix} records')
input_data.input_stats_report()
Loaded 31000 records
Cleaned input to 15951 distinct locations from 31309 posts (File 2 of 2) - Skipped posts: 213572 - skipped tags: 36969 of 232936

Total post count (PC): 31309
Total tag count (PTC): 232936
Total emoji count (PEC): 0
CPU times: user 8.11 s, sys: 503 ms, total: 8.61 s
Wall time: 8.49 s

Get Topic Coordinates

The distribution of these coordinates is what we want to visualize.

In [29]:
tm.prepare_data()
tm.init_cluster()
Long tail removal: Filtered 1959 Tags that were used by less than 5 users.
Long tail removal: Filtered 13164 Topics that were used by less than 5 users.

Topic selection

In [30]:
topic = "grass-nature-park"
points = tm.clusterer[TOPICS].get_np_points(
    item=topic,
    silent=True)

Get All Points

In [31]:
all_points = tm.clusterer[TOPICS].get_np_points()

For normalizing our final KDE raster, we'll process both, topic selection points and global data distribution (e.g. all points in the dataset).

In [32]:
points_list = [points, all_points]
  • The input data is a simply list (as a numpy.ndarray) of decimal degree coordinates
  • each entry represents a single user that published one or more posts at a specific coordinate
In [33]:
print(points[:5])
print(f'Total coordinates: {len(points)}')
[[13.766544 51.036369]
 [13.755054 51.039289]
 [13.753187 51.043785]
 [13.762543 51.037939]
 [13.753069 51.038263]]
Total coordinates: 788

Data projection

  • For faster KDE, we project data from WGS1984 (epsg:4326) to UTM
  • this allows us to directly calculate in euclidian space.
  • The TagMaps package automatically detected the most suitable UTM coordinate system,
  • for the Großer Garten sample data, this is Zone 33N (epsg:32633)

Project lat/lng to UTM 33N (Dresden) using Transformer:

In [34]:
%%time
for idx, points in enumerate(points_list):
    points_list[idx] = np.array(
        [tm.clusterer[TOPICS].proj_transformer.transform(
                point[0], point[1]
        ) for point in points])
    
crs_wgs = tm.clusterer[TOPICS].crs_wgs
crs_proj = tm.clusterer[TOPICS].crs_proj
print(f'From {crs_wgs} \nto {crs_proj}')
print(points_list[0][:5])
From epsg:4326 
to epsg:32633
[[ 413517.95931985 5654593.12815275]
 [ 412717.86708307 5654931.37875778]
 [ 412595.43876854 5655433.54329967]
 [ 413240.37616266 5654772.41458472]
 [ 412576.773973   5654819.64201159]]
CPU times: user 143 ms, sys: 1.98 ms, total: 145 ms
Wall time: 144 ms

Calculating the Kernel Density

To summarize sklearn, a KDE is executed in two steps, training and testing:

Machine learning is about learning some properties of a data set and then testing those properties against another data set. A common practice in machine learning is to evaluate an algorithm by splitting a data set into two. We call one of those sets the training set, on which we learn some properties; we call the other set the testing set, on which we test the learned properties.

The only attributes we care about in training are lat and long.

  • Stack locations using np.vstack, extract specific columns with [:,column_id]
  • reverse order: lat, lng
  • Transpose to Rows (.T), easier in python ref
In [35]:
xy_list = list()
for points in points_list:
    y = points[:, 1]
    x = points[:, 0]
    xy_list.append([x, y])
In [36]:
xy_train_list = list()
for x, y in xy_list:
    xy_train = np.vstack([y, x]).T
    xy_train_list.append(xy_train)
In [37]:
print(xy_train)
[[5655164.97172425  411138.83061996]
 [5655153.72186994  410894.39196125]
 [5655090.54346148  411243.15933809]
 ...
 [5655302.64257714  411360.89388318]
 [5654864.8466037   411463.97303812]
 [5655317.87392501  411322.44768116]]

Get bounds from total data

Access min/max decimal degree bounds object of clusterer and project to UTM33N

In [38]:
lim_lng_max = tm.clusterer[TAGS].bounds.lim_lng_max
lim_lng_min = tm.clusterer[TAGS].bounds.lim_lng_min
lim_lat_max = tm.clusterer[TAGS].bounds.lim_lat_max
lim_lat_min = tm.clusterer[TAGS].bounds.lim_lat_min

# project WDS1984 to UTM
topright = tm.clusterer[TOPICS].proj_transformer.transform(
    lim_lng_max, lim_lat_max)
bottomleft = tm.clusterer[TOPICS].proj_transformer.transform(
    lim_lng_min, lim_lat_min)

# get separate min/max for x/y
right_bound = topright[0]
left_bound = bottomleft[0]
top_bound = topright[1]
bottom_bound = bottomleft[1]

Create Sample Mesh

Create a grid of points at which to predict.

In [39]:
xbins=100j
ybins=100j
xx, yy = np.mgrid[
    left_bound:right_bound:xbins, 
    bottom_bound:top_bound:ybins]
In [40]:
xy_sample = np.vstack([yy.ravel(), xx.ravel()]).T

Generate Training data for Kernel Density.

  • The largest effect on the final result comes from the chosen bandwidth for KDE - smaller bandwidth mean higher resultion,
  • but may not be suitable for the given density of data (e.g. results with low validity).
  • Higher bandwidth will produce a smoother raster result, which may be too inaccurate for interpretation.
In [41]:
kde_list = list()
for xy_train in xy_train_list:
    kde = KernelDensity(
        kernel='gaussian', bandwidth=200, algorithm='ball_tree')
    kde.fit(xy_train)
    kde_list.append(kde)

score_samples() returns the log-likelihood of the samples

In [42]:
%%time
z_list = list()
for kde in kde_list:
    z_scores = kde.score_samples(xy_sample)
    z = np.exp(z_scores)
    z_list.append(z)
CPU times: user 7.39 s, sys: 0 ns, total: 7.39 s
Wall time: 7.38 s

Remove values below zero,
these are locations where selected topic is underrepresented, given the global KDE mean

In [43]:
for ix in range(0, 2):
    z_list[ix] = z_list[ix].clip(min=0)

Normalize z-scores to 1 to 1000 range for comparison

In [44]:
for idx, z in enumerate(z_list):
    z_list[idx] = np.interp(
        z, (z.min(), z.max()), (1, 1000))

Subtact global zscores from local z-scores (our topic selection)

norm_val:

  • weight of global z-score
  • lower means less normalization effect, higher means stronger normalization effect
  • range: 0 to 1
In [45]:
norm_val = 0.5
z_orig = z_list[0]
z_is = z_list[0] - (z_list[1]*norm_val)
z_results = [z_orig, z_is]

Reshape results to fit grid mesh extent

In [46]:
for idx, z_result in enumerate(z_results):
    z_results[idx] = z_results[idx].reshape(xx.shape)

Plot original and normalized meshes

In [47]:
from matplotlib.ticker import FuncFormatter

def y_formatter(y_value, __):
    """Format UTM to decimal degrees y-labels for improved legibility"""
    xy_value = tm.clusterer[
        TOPICS].proj_transformer_back.transform(
            left_bound, y_value)
    return f'{xy_value[1]:3.2f}'

def x_formatter(x_value, __):
    """Format UTM to decimal degrees x-labels for improved legibility"""
    xy_value = tm.clusterer[
        TOPICS].proj_transformer_back.transform(
            x_value, bottom_bound)
    return f'{xy_value[0]:3.1f}'
In [48]:
# a figure with a 1x2 grid of Axes
fig, ax_lst = plt.subplots(1, 2, figsize=(10, 3))

for idx, zz in enumerate(z_results):
    axis = ax_lst[idx]
    # Change the fontsize of minor ticks label 
    axis.tick_params(axis='both', which='major', labelsize=7)
    axis.tick_params(axis='both', which='minor', labelsize=7)
    # set plotting bounds
    axis.set_xlim(
                [left_bound, right_bound])
    axis.set_ylim(
                [bottom_bound, top_bound])
    
    # plot contours of the density
    levels = np.linspace(zz.min(), zz.max(), 15)
    # Create Contours,
    # save to CF-variable so we can later reference
    # it in colorbar hook
    CF = axis.contourf(
        xx, yy, zz, levels=levels, cmap='viridis')
    # titles for plots
    if idx == 0:
        title = f'Original KDE for topic {topic}\n(Flickr Only)'
    else:
        title = f'Normalized KDE for topic {topic}\n(Flickr Only)'
    axis.set_title(title, fontsize=11)
    axis.set_xlabel('lng', fontsize=10)
    axis.set_ylabel('lat', fontsize=10)
    # plot points on top
    axis.scatter(
        xy_list[1][0], xy_list[1][1], s=1, facecolor='white', alpha=0.05)
    axis.scatter(
        xy_list[0][0], xy_list[0][1], s=1, facecolor='white', alpha=0.1)
    # replace x, y coords with decimal degrees text (instead of UTM dist)
    axis.yaxis.set_major_formatter(
        FuncFormatter(y_formatter))
    axis.xaxis.set_major_formatter(
        FuncFormatter(x_formatter))
    if idx > 0:
        break
    # Make a colorbar for the ContourSet returned by the contourf call.
    cbar = fig.colorbar(CF,fraction=0.046, pad=0.04)
    cbar.set_title = ""
    cbar.sup_title = ""
    cbar.ax.set_ylabel(
        "Number of \n User Post Locations (UPL)", fontsize=11)
    cbar.ax.set_title('')
    cbar.ax.tick_params(axis='both', labelsize=7)
    cbar.ax.yaxis.set_tick_params(width=0.5, length=4)

Depending on the data source and chosen topic, the two maps above will differ more or less.

For the Großer Garten Example and and nature topic,
the normalized map features little difference to the original KDE,
likely because there are little false positives in the surrounding area for nature reactions.

Combine HDBSCAN and KDE

For the final map, we want to combine the different data we collected.

This means adding cluster shapes from HDBSCAN and generating
a standalone interactive and external html for sharing.

In [49]:
def get_poly_patch(polygon):
    """Returns a matplotlib polygon-patch from shapely polygon"""
    patch = PolygonPatch(
        polygon, fc='#000000',
        ec='#999999', fill=False,
        zorder=1, alpha=0.7)
    return patch

def add_patch(polygon, axis):  
    xs, ys = polygon.exterior.xy    
    axis.fill(xs, ys, alpha=0.5, fc='none', ec='white')
In [50]:
fig, axis = plt.subplots(1)

# Change the fontsize of minor ticks label 
axis.tick_params(
    axis='both', which='major', labelsize=7)
axis.tick_params(
    axis='both', which='minor', labelsize=7)
# set plotting bounds
axis.set_xlim(
            [left_bound, right_bound])
axis.set_ylim(
            [bottom_bound, top_bound])

# plot contours of the density
levels = np.linspace(zz.min(), zz.max(), 25)
CF = axis.contourf(
    xx, yy, zz, levels=levels, cmap='viridis')

title = (f'Normalized KDE for topic {topic} '
         f'\n with HDBSCAN Cluster Shapes added '
         f'\n(Flickr Only)')
axis.set_title(title)
axis.set_xlabel('lng', fontsize=10)
axis.set_ylabel('lat', fontsize=10)
# plot points on top
axis.scatter(
    xy_list[0][0], xy_list[0][1],
    s=1, facecolor='white', alpha=0.1)
axis.yaxis.set_major_formatter(
    FuncFormatter(y_formatter))
axis.xaxis.set_major_formatter(
    FuncFormatter(x_formatter))

# Make a colorbar for the ContourSet 
# returned by the contourf call.
cbar = fig.colorbar(
    CF,
    label='Number of \n User Post Locations (UPL)',
    fraction=0.046, pad=0.04)
cbar.set_title = ""
cbar.sup_title = ""
cbar.ax.set_title('')
cbar.ax.tick_params(axis='both', labelsize=7)
cbar.ax.yaxis.set_tick_params(width=0.5, length=4)

# Get Cluster Shapes from TagMaps
tm.clusterer[TOPICS].cluster_distance = 200
result = tm.clusterer[TOPICS].cluster_item(
    item=topic,
    preview_mode=True)
clusters = result.clusters
selected_post_guids = result.guids

cluster_guids = tm.clusterer[TOPICS]._get_cluster_guids(
    clusters, selected_post_guids)
shapes = tm.clusterer[TOPICS]._get_item_clustershapes(
    ItemCounter(topic, 0), cluster_guids.clustered)

for alpha_shape in shapes.alphashape:
    add_patch(alpha_shape.shape, axis)
--> 12 cluster.

Overlay Raster in Geoviews for interactive display

In [51]:
import holoviews as hv
import geoviews as gv
import geoviews.feature as gf
from cartopy.crs import epsg
import geoviews.tile_sources as gts
hv.notebook_extension('bokeh')

Get cartopy projection from epsg string.

In [52]:
cc_proj = epsg(
    crs_proj.split(":")[1])

Create contours from kde data

In [53]:
contours = gv.FilledContours(
    (xx.T, yy.T, zz.T),
    vdims='Number of \n User Post Locations (UPL)',
    crs=cc_proj)

Create additional layer for HDBSCAN cluster shapes

In [54]:
shape_list = [shape.shape for shape in shapes.alphashape]
gv_shapes = gv.Polygons(
    shape_list,
    label=f'HDBSCAN city-wide Cluster Shapes for topic',
    crs=cc_proj)

Prepare plotting

In [55]:
def set_active_tool(plot, element):
    """Enable wheel_zoom by default"""
    plot.state.toolbar.active_scroll = plot.state.tools[0]

Prepare projection from WGS1984 to Web Mercator

In [56]:
from pyproj import Transformer

crs_proj = f"epsg:4326"
crs_wgs = "epsg:3857"

# define Transformer ahead of time
# with xy-order of coordinates
proj_transformer = Transformer.from_crs(
    crs_wgs, crs_proj, always_xy=True)

# also define reverse projection
proj_transformer_back = Transformer.from_crs(
    crs_proj, crs_wgs, always_xy=True)

Zoom in to 50% of original data extent

In [57]:
xcrop = (lim_lng_max - lim_lng_min) / 4
ycrop = (lim_lat_max - lim_lat_min) / 4
lim_lng_min_crop = lim_lng_min + xcrop
lim_lng_max_crop = lim_lng_max - xcrop
lim_lat_min_crop = lim_lat_min + ycrop
lim_lat_max_crop = lim_lat_max - ycrop

Apply transformation

In [58]:
gg_bottomleft = proj_transformer_back.transform(
    lim_lng_min_crop, lim_lat_min_crop)
gg_topright = proj_transformer_back.transform(
    lim_lng_max_crop, lim_lat_max_crop)

Construct final overlay from individual layers

  1. Backgorund Tiles (sattelite)
  2. All Points (Photos)
  3. KDE Contours
  4. Selected topic points ('green' photos)
  5. HDBSCAN Cluster Shapes
In [59]:
bg_layer = gv.tile_sources.EsriImagery.opts(alpha=1.0)
all_photos_layer = gv.Points(
    (xy_list[1][0], xy_list[1][1]),
    label='All Photos (white)',
    crs=cc_proj).opts(size=5, color='white', alpha=0.3)
contours_layer = contours.opts(
    colorbar=True, cmap='viridis', alpha=0.3,
    clipping_colors={'NaN': (0, 0, 0, 0)})
topic_layer = gv.Points(
    (xy_list[0][0], xy_list[0][1]),
    label='Photos for selected topic',
    crs=cc_proj).opts(size=5, color='orange', alpha=0.5)
cluster_layer = gv_shapes.opts(
    line_color='white', line_width=0.5, fill_alpha=0)

Compile final output

Have a look at the generated interactive html here

In [60]:
layer_options = {
    "responsive":True,
    "title":f"Frequentation surface for topic {topic}",
    "hooks":[set_active_tool],
    "data_aspect":0.6,
    "xlim":(gg_bottomleft[0], gg_topright[0]),
    "ylim":(gg_bottomleft[1], gg_topright[1])
}
In [61]:
gv_layers = hv.Overlay(
    bg_layer * \
    all_photos_layer * \
    contours_layer * \
    topic_layer * \
    cluster_layer)

View inline

In [62]:
gv_layers.opts(**layer_options)
Out[62]:

Unset data_aspect for automatic resize to full browser height, then export to standalone HTML:

In [63]:
layer_options["data_aspect"] = None

Save to external HTML.

In [64]:
hv.save(
    gv_layers.opts(**layer_options),
    OUTPUT / f'heatmap_freq_{topic}.html', backend='bokeh')

Export as Shapefile(s)

Data interchange is critical. Export Contours and other data used here as Shapefiles (UTM Zone 33N).

In [65]:
from datetime import date
today = str(date.today())

Contours

For the contourf matplotlib format, we use flopy export.

In [67]:
import flopy
from flopy.export.utils import export_contourf
epsg_output = tm.clusterer[TOPICS].crs_proj.split(":")[1]

# write a prj file first;
# to be used in the export_contourf function
prj_file = OUTPUT / f"{epsg_output}.prj"
with open(prj_file, "w") as prj:
    epsg_wkt = tools.get_wkt_prj(epsg_output)
    prj.write(epsg_wkt)

export_contourf(
    str(OUTPUT / f'{today}_KDE.shp'), 
    CF, prj=prj_file)
Writing 60 polygons
.prj file /home/jovyan/work/jupyter_python_datascience/notebooks/out/2024-02-12_KDE.prj already exists
wrote /home/jovyan/work/jupyter_python_datascience/notebooks/out/2024-02-12_KDE.shp

Alpha Shapes

In [68]:
from typing import List, Dict, Any
def write_shapefile(
    name: Path, schema: Dict[str, str],
    properties_dict, geometries: List[Any],
    epsg_code: str):
    """Write list of geometries to shapefile"""
    with fiona.open(
        name, 'w', 'ESRI Shapefile', schema,
        crs=epsg_code) as shapefile:
        for geometry in geometries:
            shapefile.write({
                'geometry': mapping(geometry),
                'properties': properties_dict,
                })
In [69]:
import fiona
from fiona.crs import from_epsg
from shapely.geometry import mapping, Polygon, Point

# Define a polygon feature geometry with one attribute
schema = {
    'geometry': 'Polygon',
    'properties': {'id': 'str'},
    }

# Write a new Shapefile
write_shapefile(
    name=OUTPUT / f'{today}_Clusters.shp',
    schema=schema,
    properties_dict={'id': "photo cluster"},
    geometries=shape_list,
    epsg_code=from_epsg(epsg_output))

Post points

Use list comprehension to create a list of Shapely points

In [70]:
points = [Point(xx, yy) for xx, yy in zip(xy_list[1][0], xy_list[1][1])]
In [71]:
schema = {
    'geometry': 'Point',
    'properties': {'id': 'str'},
    }

write_shapefile(
    name=OUTPUT / f'{today}_AllPhotos.shp',
    schema=schema,
    properties_dict={'id': "All photos"},
    geometries=points,
    epsg_code=from_epsg(epsg_output))

Write topic points:

In [72]:
points = [Point(xx, yy) for xx, yy in zip(xy_list[0][0], xy_list[0][1])]
In [73]:
write_shapefile(
    name=OUTPUT / f'{today}_SelectedPhotos.shp',
    schema=schema,
    properties_dict={'id': "Photos for selected topic"},
    geometries=points,
    epsg_code=from_epsg(epsg_output))

Create Notebook HTML

In [ ]:
!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./03_tagmaps.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False 

Create ZIP with all results

During the workshop, you have created several files in the out folder.

Files can be downloaded individually, but we can also zip contents, to make life easier.

In [74]:
%%time

zip_file = OUTPUT / f'{today}_workshop_results.zip'
if not zip_file.exists():
    tools.zip_dir(OUTPUT, zip_file)
CPU times: user 4.57 s, sys: 381 ms, total: 4.95 s
Wall time: 10.8 s

Download the zipfile from the out folder on the left.

Clean up input folder

In [75]:
tools.clean_folders([INPUT, Path.cwd() / "02_Output"])
Done. Thank you. Do not forget to shut down your notebook server (File > Shut Down), once you are finished with the last notebook.

Summary

In [76]:
root_packages = [
    'python', 'geoviews', 'holoviews', 'ipywidgets', 'descartes', 'hvplot',
    'flopy', 'mapclassify', 'memory_profiler', 'python-dotenv', 'shapely',
    'tagmaps', 'matplotlib', 'sklearn', 'numpy', 'pandas', 'bokeh', 'fiona']
tools.package_report(root_packages)
List of package versions used in this notebook
package python Fiona Shapely bokeh descartes flopy geoviews holoviews hvplot ipywidgets
version 3.9.15 1.8.20 1.7.1 2.4.3 1.1.0 3.3.6 1.9.5 1.14.8 0.8.3 8.0.5
package mapclassify matplotlib numpy pandas python-dotenv tagmaps
version 2.5.0 3.7.1 1.22.4 1.5.3 1.0.0 0.21.1
In [ ]: