Alexander Dunkel, Institute of Cartography, TU Dresden
This notebook demonstrates how to load and use the HyperLogLog benchmark data shared in
Dunkel, A., Hartmann, M. C., Hauthal, E., Burghardt, Dirk, & Purves, R. S. (2023). From sunrise to sunset: Exploring landscape preference through global reactions to ephemeral events captured in georeferenced social media. PLoS ONE, 17(1). DOI
The data is available in a repository
To run this notebook, as a starting point, you have two options:
Clone the repository and edit your .env
value to point to the repsitory, where this notebook can be found, e.g.:
git clone https://gitlab.vgiscience.de/lbsn/tools/jupyterlab.git
cd jupyterlab
cp .env.example .env
nano .env
## Enter:
# JUPYTER_NOTEBOOKS=~/notebooks/sunsetsunrise-demo
# TAG=v0.12.3
docker network create lbsn-network
docker-compose pull && docker-compose up -d
We also use pg-hll-empty Postgresql as a backend, to work with the HyperLogLog data.
You can use the python package python-hll instead (which is included in the Carto-Lab Docker container),
but it will be slower than the native postgresql-hll implementation. To pull pg-hll-empty
and
add to the same docker network:
git clone https://gitlab.vgiscience.de/lbsn/databases/pg-hll-empty.git
cd pg-hll-empty
cp vars.env.example vars.env
nano vars.env
# optionally adjust READONLY_USER_PASSWORD
docker-compose up -d
Load dependencies:
import os
import psycopg2 # Postgres API
import geopandas as gp
import pandas as pd
import matplotlib.pyplot as plt
from typing import List, Tuple, Dict, Optional
from pyproj import Transformer, CRS, Proj
from IPython.display import clear_output, display, HTML
Activate autoreload of changed python files:
%load_ext autoreload
%autoreload 2
Define initial parameters that affect processing
WORK_DIR = Path.cwd().parents[0] / "tmp"
"""Working diretcory"""
EPSG_CODE = 54009
CRS_PROJ = f"esri:{EPSG_CODE}"
"""Target projection: Mollweide (epsg code).
note: Mollweide defined by _esri_
in epsg.io's database"""
CRS_WGS = "epsg:4326"
"""Input projection (Web Mercator)"""
OUTPUT = Path.cwd().parents[0] / "out"
"""Define path to output directory (figures etc.)""";
WORK_DIR.mkdir(exist_ok=True)
source_zip="https://opara.zih.tu-dresden.de/xmlui/bitstream/handle/123456789/5793/S10.zip?sequence=1&isAllowed=y"
if not (WORK_DIR / "flickr-all.csv").exists():
tools.get_zip_extract(uri=source_zip, filename="S10.zip", output_path=WORK_DIR)
Show the directory tree:
tools.tree(Path.cwd().parents[0])
ALL_FLICKR = WORK_DIR / "flickr_all_hll.csv"
dtypes = {'latitude': float, 'longitude': float}
df = pd.read_csv(ALL_FLICKR, dtype=dtypes, encoding='utf-8')
df.head()
First, let's test the slower Python implemention python-hll, to get a cardinality for some of the original HLL strings. Cardinality means the estimated count of distinct items in a set, which refers to the User Count in our case.
from python_hll.util import NumberUtil
from python_hll.hll import HLL
def hll_from_byte(hll_set: str):
"""Return HLL set from binary representation"""
hex_string = hll_set[2:]
return HLL.from_bytes(
NumberUtil.from_hex(
hex_string, 0, len(hex_string)))
Have a look at the first HLL set:
df["user_hll"][0]
Cast to HLL and calculate cardinality in one step:
hll_from_byte(df["user_hll"][0]).cardinality() - 1
An estimated number of 12 users has been observed at location 0.040050,-179.750586
(lat, lng).
These latitude and longitude coordinates refer to the centroids of the original 50 km grid.
In order to speed up processing, we can connect to a Postgres Database running the faster postgresql-hll extension from citus.
Password and username for connecting to local hllworker are loaded from environment.
DB_USER = "hlluser"
DB_PASS = os.getenv('READONLY_USER_PASSWORD')
# set connection variables
DB_HOST = "hllworkerdb"
DB_PORT = "5432"
DB_NAME = "hllworkerdb"
Connect to empty Postgres database running HLL Extension:
DB_CONN = psycopg2.connect(
host=DB_HOST,
port=DB_PORT ,
dbname=DB_NAME,
user=DB_USER,
password=DB_PASS
)
DB_CONN.set_session(
readonly=True)
DB_CALC = tools.DbConn(
DB_CONN)
CUR_HLL = DB_CONN.cursor()
Test connection:
CUR_HLL.execute("SELECT 1;")
print(CUR_HLL.statusmessage)
Test with actuall HLL set calculation:
db_query = f"""
SELECT lat, lng, hll_cardinality(user_hll)::int as usercount
FROM (VALUES
('{df["latitude"][0]}', '{df["longitude"][0]}', '{df["user_hll"][0]}'::hll)
) s(lat, lng, user_hll)
"""
df2 = DB_CALC.query(db_query)
df2.head()
Store this as two methods, to be used later.
def union_hll(hll_sets: List[str], cur_hll = CUR_HLL) -> str:
"Union two or more HLL sets and return the combined HLL set"
hll_values = ", ".join(f"('{hll_set}'::hll)" for hll_set in hll_sets)
db_query = f"""
SELECT hll_union_agg(s.hll_set)
FROM (
VALUES
{hll_values}
) s(hll_set);
"""
CUR_HLL.execute(db_query)
# return results as first from tuple
return CUR_HLL.fetchone()[0]
def cardinality_hll(hll_set) -> int:
"Calculate the cardinality of a HLL set and return as int"
CUR_HLL.execute(f"SELECT hll_cardinality('{hll_set}')::int;")
return int(CUR_HLL.fetchone()[0])
results = union_hll([df["user_hll"][0], df["user_hll"][1]])
results
cardinality_hll(results)
Now that we verified individual HLLs, we will
world = gp.read_file(
gp.datasets.get_path('naturalearth_lowres'),
crs=CRS_WGS)
world = world.to_crs(CRS_PROJ)
world = world[['continent', 'geometry']]
ax = world.plot(column='continent')
ax.set_axis_off()
There are slight inaccuracies in Continent Geometries (overlapping polygons), which can be fixed by a small buffer, before dissolving country geometries:
world['geometry'] = world.buffer(0.01)
continents = world.dissolve(by='continent')
continents.head()
Assign index back as a column, to classify via color by column='continent'
continents['continent'] = continents.index.get_level_values(0)
ax = continents.plot(column='continent')
ax.set_axis_off()
gdf = gp.GeoDataFrame(
df, geometry=gp.points_from_xy(df.longitude, df.latitude))
gdf.crs=CRS_WGS
gdf = gdf.to_crs(CRS_PROJ)
Define pyproj Transformer ahead of time with xy-order of coordinates, for future projections between WGS1984 and EPSG 54009 (Mollweide)
PROJ_TRANSFORMER = Transformer.from_crs(
CRS_WGS, CRS_PROJ, always_xy=True)
# also define reverse projection
PROJ_TRANSFORMER_BACK = Transformer.from_crs(
CRS_PROJ, CRS_WGS, always_xy=True)
gdf.set_index(['latitude', 'longitude'], inplace=True)
gdf.head()
Define a preview area (Italy)
bbox_eu = (
6.8662109375, 35.24427318493909,
22.31396484375, 45.29320031385282)
# create bounds from WGS1984 italy and project to Mollweide
minx, miny = PROJ_TRANSFORMER.transform(
bbox_eu[0], bbox_eu[1])
maxx, maxy = PROJ_TRANSFORMER.transform(
bbox_eu[2], bbox_eu[3])
gdf_italy = gdf.cx[minx:maxx, miny:maxy]
%%time
cardinality_series = tools.hll_series_cardinality(gdf_italy["user_hll"], db_conn=DB_CALC)
cardinality_series.head()
gdf.loc[
cardinality_series.index,
"usercount"] = cardinality_series
Preview HLL coordinates and cardinality:
fig, ax = plt.subplots(1, 1)
ax.set_xlim(minx, maxx)
ax.set_ylim(miny, maxy)
plot_kwd = {
'column':'usercount',
'legend':True,
'markersize':25,
'cmap':'Reds',
'scheme':'quantiles',
'legend_kwds':{'bbox_to_anchor': (1, 1), 'frameon':False, 'markerscale':0.7}
}
base = gdf.cx[minx:maxx, miny:maxy].plot(
ax=ax, **plot_kwd)
world.plot(
ax=base, color='none', edgecolor='black', linewidth=0.5)
ax.set_axis_off()
The last step is to union HLL sets per continent and calculate the combined cardinality.
%%time
gdf_overlay = gp.overlay(
gdf, world,
how='intersection')
gdf_overlay.set_index("continent", inplace=True)
gdf_overlay.head()
%%time
cardinality_series = tools.union_hll_series(
gdf_overlay["user_hll"], db_conn=DB_CALC, cardinality=True)
Below we can see the combined count distinct of users on Flickr per continent.
cardinality_series.sort_values(ascending=False).head(10)
Assign cardinality back to continents Geodataframe
continents.loc[
cardinality_series.index,
"usercount"] = cardinality_series.values
plot_kwd['legend_kwds'] = {'bbox_to_anchor': (1.4, 1), 'frameon':False}
ax = continents.plot(**plot_kwd)
ax.set_axis_off()
It is also possible to intersect HLL sets, to some degree. This can be used to estimate common visitor counts for (e.g.) countries. The average error, however, will be larger, especially for intersections of HLL sets for different sizes.
world = gp.read_file(
gp.datasets.get_path('naturalearth_lowres'),
crs=CRS_WGS)
world = world.to_crs(
CRS_PROJ)
Select geometry for DE, FR and UK
geom1 = world[world['name'] == "Germany"]
geom2 = world[world['name'] == "France"]
geom3 = world[world['name'] == "United Kingdom"]
geom1 = world[world['name'] == "Germany"]
geom2 = world[world['name'] == "United States of America"]
geom3 = world[world['name'] == "Canada"]
----
ref_a = "de"
ref_b = "us"
ref_c = "ca"
name_ref = {
ref_a:"Germany",
ref_b:"USA",
ref_c:"Canada"}
ref_a = "de"
ref_b = "fr"
ref_c = "uk"
name_ref = {
ref_a:"Germany",
ref_b:"France",
ref_c:"United Kingdom"}
For visual purposes, for some countries, we want to filter continuous areas, e.g. mainland for the US, or mainland FR.
if ref_b == "us":
geom2 = geom2.explode(index_parts=False).iloc[0:1].dissolve(by='name')
if ref_b == "fr":
geom2 = geom2.explode(index_parts=False).iloc[1:].dissolve(by='name')
sel_colors=["#6d904f", "#008fd5", "#fc4f30"]
fig, ax = plt.subplots(1, 3)
for ix, subplot in enumerate([geom1, geom2, geom3]):
subplot.plot(ax=ax[ix], color=sel_colors[ix])
ax[ix].set_axis_off()
from geopandas.tools import sjoin
gdf_intersect = sjoin(
gdf, geom1, how='right')
hll_series = {
}
Union all HLL sets for selected geometry.
hll_series[ref_a] = union_hll(gdf_intersect["user_hll"])
This is the unioned HLL set for all unique users on Flickr who shared at least one geotagged photo from Germany.
hll_series[ref_a]
.. an estimated number of:
cardinality_hll(hll_series[ref_a])
Vice versa, we can "update" select original coordinates that intersect with selected geometries by using the index:
def get_selected_coords(gdf_intersect, gdf) -> gp.GeoDataFrame:
"""Return selected coordinates based on index"""
gdf_intersect.set_index(
["index_left0", "index_left1"],
inplace=True)
return gdf.loc[gdf_intersect.index]
coords_ref_a = get_selected_coords(gdf_intersect, gdf)
Plot preview
minx, miny, maxx, maxy = gdf_intersect.total_bounds
fig, ax = plt.subplots(1, 1)
ax.set_xlim(minx, maxx)
ax.set_ylim(miny, maxy)
base = world.plot(
ax=ax, color='none', edgecolor='black', linewidth=0.5)
layers = coords_ref_a.plot(ax=base, markersize=5)
layers.set_axis_off()
Repeat for the other two countries
gdf_intersect = sjoin(
gdf, geom2, how='right')
hll_series[ref_b] = union_hll(gdf_intersect["user_hll"])
coords_ref_b = get_selected_coords(gdf_intersect, gdf)
gdf_intersect = sjoin(
gdf, geom3, how='right')
hll_series[ref_c] = union_hll(gdf_intersect["user_hll"])
coords_ref_c = get_selected_coords(gdf_intersect, gdf)
Return distinct errors with error margin (±2.3%)
for ref, hll_set in hll_series.items():
cardinality = cardinality_hll(hll_set)
print(f'Estimated {cardinality} (±{int(cardinality/100*2.3)}) users for {name_ref.get(ref)}')
Get the total bounds:
minx, miny, maxx, maxy = pd.concat(
[coords_ref_a["geometry"], coords_ref_b["geometry"], coords_ref_c["geometry"]]).total_bounds
from typing import List, Optional
def plot_map(
gdf: gp.GeoDataFrame, sel_coords: List[gp.GeoDataFrame],
sel_colors: List[str], buf = 100000,
title: Optional[str] = None, ax = None):
"""Plot GeoDataFrame with matplotlib backend, optionaly export as png"""
if not ax:
fig, ax = plt.subplots(1, 1, figsize=(5, 10))
ax.set_xlim(minx-buf, maxx+buf)
ax.set_ylim(miny-buf, maxy+buf)
if title:
ax.set_title(title, fontsize=12)
for ix, sel_coord in enumerate(sel_coords):
sel_coord.plot(
ax=ax,
color=sel_colors[ix],
alpha=1.0,
markersize=10)
gdf.plot(
ax=ax,
alpha=0.2,
markersize=0.1)
# combine with world geometry
world.plot(
ax=ax, color='none', edgecolor='black', linewidth=0.3)
# ax.imshow(fig, interpolation='nearest', cmap='gray')
# turn axis off
ax.set_axis_off()
sel_coords=[coords_ref_a, coords_ref_b, coords_ref_c]
plot_map(
gdf=gdf, sel_coords=sel_coords,
sel_colors=sel_colors,
title=f'Grid selection for {ref_a.upper()}, {ref_b.upper()} and {ref_c.upper()}')
According to the Union-intersection-principle:
$|A \cup B| = |A| + |B| - |A \cap B|$
which can also be written as:
$|A \cap B| = |A| + |B| - |A \cup B|$
Therefore, unions can be used to calculate intersection. Calculate $|DE \cup US|$, $|DE \cup CA|$ and $|US \cup CA|$, i.e.:
IntersectionCount =
cardinality_hll(de)::int +
cardinality_hll(us)::int -
cardinality_hll(hll_union(de, us)
distinct_users_total = {}
distinct_users_total[ref_a] = cardinality_hll(hll_series[ref_a])
distinct_users_total[ref_b] = cardinality_hll(hll_series[ref_b])
distinct_users_total[ref_c] = cardinality_hll(hll_series[ref_c])
distinct_users_total
union_p1 = (hll_series[ref_a], hll_series[ref_b])
union_p2 = (hll_series[ref_a], hll_series[ref_c])
union_p3 = (hll_series[ref_b], hll_series[ref_c])
First, prepare combination for different sets.
hll_sel = {
f"{ref_a}-{ref_b}": union_p1,
f"{ref_a}-{ref_c}": union_p2,
f"{ref_b}-{ref_c}": union_p3
}
distinct_common = {}
for int_tuple, hll_sets in hll_sel.items():
hll_set = union_hll(
[hll_sets[0], hll_sets[1]])
distinct_common[int_tuple] = cardinality_hll(hll_set)
print(
f"{distinct_common[int_tuple]} distinct total users "
f"who shared Flickr photos from either {name_ref.get(int_tuple.split('-')[0])} "
f"or {name_ref.get(int_tuple.split('-')[1])} (union)")
Calculate intersection
distinct_intersection = {}
for a, b in [(ref_a, ref_b), (ref_a, ref_c), (ref_b, ref_c)]:
a_total = distinct_users_total.get(a)
b_total = distinct_users_total.get(b)
common_ref = f'{a}-{b}'
intersection_count = a_total + b_total - distinct_common.get(common_ref)
distinct_intersection[common_ref] = intersection_count
print(
f"{distinct_intersection[common_ref]} distinct users "
f"who shared Flickr photos from {name_ref.get(a)} and {name_ref.get(b)} (intersection)")
Finally, lets get the number of users who have shared pictures from all three countries, based on the formula for three sets:
$|A \cup B \cup C| = |A| + |B| + |C| - |A \cap B| - |A \cap C| - |B \cap C| + |A \cap B \cap C|$
which can also be written as:
$|A \cap B \cap C| = |A \cup B \cup C| - |A| - |B| - |C| + |A \cap B| + |A \cap C| + |B \cap C|$
union_count_all = cardinality_hll(union_hll(
[hll_series[ref_a], hll_series[ref_b], hll_series[ref_c]]))
union_count_all
intersection_count_all = union_count_all - \
distinct_users_total[ref_a] - \
distinct_users_total[ref_b] - \
distinct_users_total[ref_c] + \
distinct_intersection[f'{ref_a}-{ref_b}'] + \
distinct_intersection[f'{ref_a}-{ref_c}'] + \
distinct_intersection[f'{ref_b}-{ref_c}']
print(intersection_count_all)
Since negative visitor counts are impossible, the increasing inaccuracies for nested intersections are easily observable here. In other words, there is simply too little overlap between the visitors between all three countries to be measurable with HLL intersections
Since we're going to visualize this with matplotlib-venn, we need the following variables:
from matplotlib_venn import venn3, venn3_circles
plt.figure(figsize=(3,3))
v = venn3(
subsets=(
500,
500,
100,
500,
100,
100,
10),
set_labels = ('A', 'B', 'C'))
v.get_label_by_id('100').set_text('Abc')
v.get_label_by_id('010').set_text('aBc')
v.get_label_by_id('001').set_text('abC')
v.get_label_by_id('110').set_text('ABc')
v.get_label_by_id('101').set_text('AbC')
v.get_label_by_id('011').set_text('aBC')
v.get_label_by_id('111').set_text('ABC')
plt.show()
We already have ABC
, the other values can be calculated:
ABC = intersection_count_all
ABc = distinct_intersection[f'{ref_a}-{ref_b}'] - ABC
aBC = distinct_intersection[f'{ref_b}-{ref_c}'] - ABC
AbC = distinct_intersection[f'{ref_a}-{ref_c}'] - ABC
Abc = distinct_users_total[ref_a] - ABc - AbC + ABC
aBc = distinct_users_total[ref_b] - ABc - aBC + ABC
abC = distinct_users_total[ref_c] - aBC - AbC + ABC
Define Function to plot Venn Diagram.
from typing import Tuple
def plot_venn(
subset_sizes: List[int],
colors: List[str],
names: List[str],
subset_sizes_raw: List[int] = None,
ax = None,
title: str = None):
"""Plot Venn Diagram"""
if not ax:
fig, ax = plt.subplots(1, 1, figsize=(5,5))
set_labels = (
'A', 'B', 'C')
v = venn3(
subsets=(
[subset_size for subset_size in subset_sizes]),
set_labels = set_labels,
ax=ax)
for ix, idx in enumerate(
['100', '010', '001']):
v.get_patch_by_id(
idx).set_color(colors[ix])
v.get_patch_by_id(
idx).set_alpha(0.8)
v.get_label_by_id(
set_labels[ix]).set_text(
names[ix])
if subset_sizes_raw:
for ix, idx in enumerate(
['100', '010', None, '001']):
if not idx:
continue
dif_abs = subset_sizes[ix] - subset_sizes_raw[ix]
dif_perc = dif_abs / (subset_sizes_raw[ix] / 100)
v.get_label_by_id(idx).set_text(
f'{subset_sizes[ix]}\n{dif_perc:+.1f}%')
label_ids = [
'100', '010', '001',
'110', '101', '011',
'111', 'A', 'B', 'C']
for label_id in label_ids:
v.get_label_by_id(
label_id).set_fontsize(14)
# draw borders
c = venn3_circles(
subsets=(
[subset_size for subset_size in subset_sizes]),
linestyle='dashed',
lw=1,
ax=ax)
if title:
ax.title.set_text(title)
Plot Venn Diagram:
subset_sizes = [
Abc, aBc, ABc, abC, AbC, aBC, ABC]
names = [
name_ref.get(ref_a), name_ref.get(ref_b), name_ref.get(ref_c)]
plot_venn(
subset_sizes=subset_sizes,
colors=sel_colors,
names=names,
title="Common User Count")
A figure with two subplots (1 row, 2 columns).
fig, ax = plt.subplots(1, 2, figsize=(10, 15), width_ratios=[1, 1])
plot_map(
gdf=gdf, sel_coords=sel_coords,
sel_colors=sel_colors, ax=ax[0])
plot_venn(
subset_sizes=subset_sizes,
colors=sel_colors,
names=names,
ax=ax[1])
# store as png
fig.savefig(
Path.cwd().parents[0] / "resources" / f"hll_intersection_{ref_a}{ref_b}{ref_c}.png", dpi=300, format='PNG',
bbox_inches='tight', pad_inches=0)
Each dot on the left side is a coordinate shared in the original dataset, with an attached HLL set abstracting all users having shared photographs from this part of the world. By union of all HLL sets for each of the three country, based on coordinate-country inersection, we were able to calculate the intersection between different sets. These estimated common visitor counts are labeled on the right side in the Venn diagram.
The number of distinct users who shared photos from UK outweighs Germany and France.There is a slightly larger overlap between common visitors for France and UK. By union of HLL sets, we are able to estimate distinct visitor counts for arbitrary areas or regions. This ability to re-use HLL sets is, however, limited by the lower limit of 50km resolution of the shared benchmark dataset.
!jupyter nbconvert --to html_toc \
--output-dir=../resources/html/ ./intro.ipynb \
--template=../nbconvert.tpl \
--ExtractOutputPreprocessor.enabled=False >&- 2>&-