From sunrise to sunset - Demo for benchmark data use ¶

Alexander Dunkel, Institute of Cartography, TU Dresden

This notebook demonstrates how to load and use the HyperLogLog benchmark data shared in

Dunkel, A., Hartmann, M. C., Hauthal, E., Burghardt, Dirk, & Purves, R. S. (2023). From sunrise to sunset: Exploring landscape preference through global reactions to ephemeral events captured in georeferenced social media. PLoS ONE, 17(1). DOI

The data is available in a repository

Unfamiliar with HLL?

A basic introduction to working with HLL data is provided in this jupyter notebook
The scoping study for the sunset-sunrise paper was based on the Flickr YFCC 100M dataset, and the corresponding notebooks can be found here
The notebooks for the sunset-sunrise article can be found here

Prepare environment¶

To run this notebook, as a starting point, you have two options:

1. Create an environment with the packages and versions shown in the following cell.

As a starting point, you may use the latest conda environment_default.yml from our CartoLab docker container.

2. If docker is available to to, we suggest to use the Carto-Lab Docker Container

Clone the repository and edit your .env value to point to the repsitory, where this notebook can be found, e.g.:

git clone https://gitlab.vgiscience.de/lbsn/tools/jupyterlab.git
cd jupyterlab
cp .env.example .env
nano .env
## Enter:
# JUPYTER_NOTEBOOKS=~/notebooks/sunsetsunrise-demo
# TAG=v0.12.3
docker network create lbsn-network
docker-compose pull && docker-compose up -d

We also use pg-hll-empty Postgresql as a backend, to work with the HyperLogLog data. You can use the python package python-hll instead (which is included in the Carto-Lab Docker container), but it will be slower than the native postgresql-hll implementation. To pull pg-hll-empty and add to the same docker network:

git clone https://gitlab.vgiscience.de/lbsn/databases/pg-hll-empty.git
cd pg-hll-empty
cp vars.env.example vars.env
nano vars.env
# optionally adjust READONLY_USER_PASSWORD
docker-compose up -d

Load dependencies:

import os
import psycopg2 # Postgres API
import geopandas as gp
import pandas as pd
import matplotlib.pyplot as plt
from typing import List, Tuple, Dict, Optional
from pyproj import Transformer, CRS, Proj
from IPython.display import clear_output, display, HTML

Activate autoreload of changed python files:

%load_ext autoreload
%autoreload 2

Parameters¶

Define initial parameters that affect processing

WORK_DIR = Path.cwd().parents[0] / "tmp"
"""Working diretcory"""
EPSG_CODE = 54009
CRS_PROJ = f"esri:{EPSG_CODE}"
"""Target projection: Mollweide (epsg code).
note: Mollweide defined by _esri_
in epsg.io's database"""
CRS_WGS = "epsg:4326"
"""Input projection (Web Mercator)"""
OUTPUT = Path.cwd().parents[0] / "out"
"""Define path to output directory (figures etc.)""";

WORK_DIR.mkdir(exist_ok=True)

Load benchmark data¶

source_zip="https://opara.zih.tu-dresden.de/xmlui/bitstream/handle/123456789/5793/S10.zip?sequence=1&isAllowed=y"

if not (WORK_DIR / "flickr-all.csv").exists():
    tools.get_zip_extract(uri=source_zip, filename="S10.zip", output_path=WORK_DIR)

Show the directory tree:

tools.tree(Path.cwd().parents[0])

.├── .env├── .gitignore├── .gitlab-ci.yml├── LICENSE.md├── README.md├── conf.json├── jupytext.toml├── md│   └── intro.md├── nbconvert.tpl├── notebooks│   ├── hll_intersection_defruk.png│   ├── hll_intersection_deusca.png│   └── intro.ipynb├── py│   ├── _intro.py│   └── modules│       └── tools.py├── resources│   └── html│       └── intro.html└── tmp    ├── 2020-04-07_Flickr_Sunrise_World_CCBy.csv    ├── 2020-04-07_Flickr_Sunset_World_CCBy.csv    ├── 20210202_FLICKR_SUNSET_random_country_tf_idf.csv    ├── 20211029_FLICKR_SUNSET_random_country_cosine_similarity_binary.csv    ├── flickr-all.csv    ├── flickr-sunrise-months.csv    ├── flickr-sunset-months.csv    ├── flickr-terms.csv    ├── flickr_all_hll.csv    ├── flickr_sunrise_hll.csv    ├── flickr_sunset_hll.csv    ├── flickr_sunset_terms_country.csv    ├── instagram-terms.csv    ├── instagram_random_hll.csv    ├── instagram_sunrise_hll.csv    └── instagram_sunset_hll.csv
7 directories, 31 files

Load data¶

ALL_FLICKR = WORK_DIR / "flickr_all_hll.csv"

dtypes = {'latitude': float, 'longitude': float}

df = pd.read_csv(ALL_FLICKR, dtype=dtypes, encoding='utf-8')

df.head()

Test HLL¶

First, let's test the slower Python implemention python-hll, to get a cardinality for some of the original HLL strings. Cardinality means the estimated count of distinct items in a set, which refers to the User Count in our case.

from python_hll.util import NumberUtil
from python_hll.hll import HLL

def hll_from_byte(hll_set: str):
    """Return HLL set from binary representation"""
    hex_string = hll_set[2:]
    return HLL.from_bytes(
        NumberUtil.from_hex(
            hex_string, 0, len(hex_string)))

Have a look at the first HLL set:

df["user_hll"][0]

'\\x138b400e0116e13ec2414152615441a6a1b206c544e3a1e942'

Cast to HLL and calculate cardinality in one step:

hll_from_byte(df["user_hll"][0]).cardinality() - 1

11

An estimated number of 12 users has been observed at location 0.040050,-179.750586 (lat, lng).

These latitude and longitude coordinates refer to the centroids of the original 50 km grid.

Test Postgres HLL¶

In order to speed up processing, we can connect to a Postgres Database running the faster postgresql-hll extension from citus.

Password and username for connecting to local hllworker are loaded from environment.

DB_USER = "hlluser"
DB_PASS = os.getenv('READONLY_USER_PASSWORD')
# set connection variables
DB_HOST = "hllworkerdb"
DB_PORT = "5432"
DB_NAME = "hllworkerdb"

Connect to empty Postgres database running HLL Extension:

DB_CONN = psycopg2.connect(
        host=DB_HOST,
        port=DB_PORT ,
        dbname=DB_NAME,
        user=DB_USER,
        password=DB_PASS
)
DB_CONN.set_session(
    readonly=True)
DB_CALC = tools.DbConn(
    DB_CONN)
CUR_HLL = DB_CONN.cursor()

Test connection:

CUR_HLL.execute("SELECT 1;")
print(CUR_HLL.statusmessage)

SELECT 1

Test with actuall HLL set calculation:

db_query = f"""
SELECT lat, lng, hll_cardinality(user_hll)::int as usercount 
FROM (VALUES
          ('{df["latitude"][0]}', '{df["longitude"][0]}', '{df["user_hll"][0]}'::hll)
        ) s(lat, lng, user_hll)
"""

df2 = DB_CALC.query(db_query)

df2.head()

Store this as two methods, to be used later.

def union_hll(hll_sets: List[str], cur_hll = CUR_HLL) -> str:
    "Union two or more HLL sets and return the combined HLL set"
    hll_values = ", ".join(f"('{hll_set}'::hll)" for hll_set in hll_sets)
    db_query = f"""
        SELECT hll_union_agg(s.hll_set)
        FROM (
        VALUES
            {hll_values}
        ) s(hll_set);
    """
    CUR_HLL.execute(db_query)
    # return results as first from tuple
    return CUR_HLL.fetchone()[0]

def cardinality_hll(hll_set) -> int:
    "Calculate the cardinality of a HLL set and return as int"
    CUR_HLL.execute(f"SELECT hll_cardinality('{hll_set}')::int;")
    return int(CUR_HLL.fetchone()[0])

results = union_hll([df["user_hll"][0], df["user_hll"][1]])
results

'\\x138b400e0116e13ec2414152615441a6a1b206c544cbe1e3a1e942'

cardinality_hll(results)

12

Summarize distinct user per continent¶

Now that we verified individual HLLs, we will

load the full list of HLL sets for all coordinates
assign coordinates to continents
union all hll sets per continent, to retrieve the combined number of estimated distinct users having visited each continent on Flickr

Load continent geometries¶

world = gp.read_file(
    gp.datasets.get_path('naturalearth_lowres'),
    crs=CRS_WGS)
world = world.to_crs(CRS_PROJ)

world = world[['continent', 'geometry']]

ax = world.plot(column='continent')
ax.set_axis_off()

There are slight inaccuracies in Continent Geometries (overlapping polygons), which can be fixed by a small buffer, before dissolving country geometries:

world['geometry'] = world.buffer(0.01)

continents = world.dissolve(by='continent')

continents.head()

Assign index back as a column, to classify via color by column='continent'

continents['continent'] = continents.index.get_level_values(0)

ax = continents.plot(column='continent')
ax.set_axis_off()

Create a GeoDataFrame from HLL coordinates¶

gdf = gp.GeoDataFrame(
    df, geometry=gp.points_from_xy(df.longitude, df.latitude))
gdf.crs=CRS_WGS
gdf = gdf.to_crs(CRS_PROJ)

Define pyproj Transformer ahead of time with xy-order of coordinates, for future projections between WGS1984 and EPSG 54009 (Mollweide)

PROJ_TRANSFORMER = Transformer.from_crs(
    CRS_WGS, CRS_PROJ, always_xy=True)
# also define reverse projection
PROJ_TRANSFORMER_BACK = Transformer.from_crs(
    CRS_PROJ, CRS_WGS, always_xy=True)

gdf.set_index(['latitude', 'longitude'], inplace=True)

gdf.head()

Calculate Cardinality¶

Define a preview area (Italy)

bbox_eu = (
    6.8662109375, 35.24427318493909,
    22.31396484375, 45.29320031385282)
# create bounds from WGS1984 italy and project to Mollweide
minx, miny = PROJ_TRANSFORMER.transform(
    bbox_eu[0], bbox_eu[1])
maxx, maxy = PROJ_TRANSFORMER.transform(
    bbox_eu[2], bbox_eu[3])

gdf_italy = gdf.cx[minx:maxx, miny:maxy]

%%time
cardinality_series = tools.hll_series_cardinality(gdf_italy["user_hll"], db_conn=DB_CALC)

CPU times: user 5.4 ms, sys: 7.84 ms, total: 13.2 ms
Wall time: 36.7 ms

cardinality_series.head()

latitude   longitude
45.135626  7.872342     11178
44.675021  7.832822      2867
44.215775  7.794253      2090
43.757857  7.756606      4969
43.301242  7.719851        31
Name: user_hll, dtype: int64

gdf.loc[
    cardinality_series.index,
    "usercount"] = cardinality_series

Preview HLL coordinates and cardinality:

fig, ax = plt.subplots(1, 1)

ax.set_xlim(minx, maxx)
ax.set_ylim(miny, maxy)

plot_kwd = {
    'column':'usercount',
    'legend':True, 
    'markersize':25,
    'cmap':'Reds', 
    'scheme':'quantiles',
    'legend_kwds':{'bbox_to_anchor': (1, 1), 'frameon':False, 'markerscale':0.7}
    }
base = gdf.cx[minx:maxx, miny:maxy].plot(
    ax=ax, **plot_kwd)
world.plot(
    ax=base, color='none', edgecolor='black', linewidth=0.5)
ax.set_axis_off()

Union HLL sets per continent¶

The last step is to union HLL sets per continent and calculate the combined cardinality.

%%time
gdf_overlay = gp.overlay(
    gdf, world, 
    how='intersection')
gdf_overlay.set_index("continent", inplace=True)

CPU times: user 44.7 s, sys: 188 ms, total: 44.9 s
Wall time: 44.9 s

gdf_overlay.head()

%%time
cardinality_series = tools.union_hll_series(
    gdf_overlay["user_hll"], db_conn=DB_CALC, cardinality=True)

CPU times: user 123 ms, sys: 149 ms, total: 272 ms
Wall time: 1.64 s

Below we can see the combined count distinct of users on Flickr per continent.

cardinality_series.sort_values(ascending=False).head(10)

continent
Europe                     984220
North America              676036
Asia                       408161
South America              191955
Oceania                     87333
Africa                      75514
Antarctica                   1166
Seven seas (open ocean)        27
Name: hll_cardinality, dtype: int64

Assign cardinality back to continents Geodataframe

continents.loc[
    cardinality_series.index,
    "usercount"] = cardinality_series.values

plot_kwd['legend_kwds'] = {'bbox_to_anchor': (1.4, 1), 'frameon':False}
ax = continents.plot(**plot_kwd)
ax.set_axis_off()

Intersection example¶

It is also possible to intersect HLL sets, to some degree. This can be used to estimate common visitor counts for (e.g.) countries. The average error, however, will be larger, especially for intersections of HLL sets for different sizes.

world = gp.read_file(
    gp.datasets.get_path('naturalearth_lowres'),
    crs=CRS_WGS)
world = world.to_crs(
    CRS_PROJ)

Select geometry for DE, FR and UK

geom1 = world[world['name'] == "Germany"]
geom2 = world[world['name'] == "France"]
geom3 = world[world['name'] == "United Kingdom"]

Select different countries

Optionally: Modify the list of countries to adapt visualizations below.

Example:

geom1 = world[world['name'] == "Germany"]
geom2 = world[world['name'] == "United States of America"]
geom3 = world[world['name'] == "Canada"]

----

ref_a = "de"
ref_b = "us"
ref_c = "ca"
name_ref = {
    ref_a:"Germany",
    ref_b:"USA",
    ref_c:"Canada"}

ref_a = "de"
ref_b = "fr"
ref_c = "uk"
name_ref = {
    ref_a:"Germany",
    ref_b:"France",
    ref_c:"United Kingdom"}

For visual purposes, for some countries, we want to filter continuous areas, e.g. mainland for the US, or mainland FR.

if ref_b == "us":
    geom2 = geom2.explode(index_parts=False).iloc[0:1].dissolve(by='name')

if ref_b == "fr":
    geom2 = geom2.explode(index_parts=False).iloc[1:].dissolve(by='name')

sel_colors=["#6d904f", "#008fd5", "#fc4f30"]

fig, ax = plt.subplots(1, 3)
for ix, subplot in enumerate([geom1, geom2, geom3]):
    subplot.plot(ax=ax[ix], color=sel_colors[ix])
    ax[ix].set_axis_off()

Select coordinates¶

Intersect hll coordinates with country geometry. Since hll coordinates are pre-aggregated (50 km), direct intersection will yield some error rate (in this case, called MAUP).

from geopandas.tools import sjoin
gdf_intersect = sjoin(
        gdf, geom1, how='right')

hll_series = {
    }

Union all HLL sets for selected geometry.

hll_series[ref_a] = union_hll(gdf_intersect["user_hll"])

This is the unioned HLL set for all unique users on Flickr who shared at least one geotagged photo from Germany.

hll_series[ref_a]

'\\x148b4052cea314e75190739d683994e41d29425053a8e931d263a0c6318a629d274a0e63ad47499055a10831d0a4a0c742586320e82b50a69d073a50a4296741d48314c73a0c831ce639ca5338ca31d4942ca829d0651cc73acc85a0e951908614e7398e8318e8398e93290a39907319683a0c63a10749d075210842107328e8330a8318e7331083990539cc841d06418e851caa41d09294c6518c8320c649ce5324e639908315283190841ce63192d314c8425c6515074b1283b4cb3a547399a56a0a741d68320c7319074a0e5320c75192741ce7418e7324a94a8ca3a107618c8398e84bd07318eb420c7418cb4b0ab31d063b926318e732d264a5484190b3a8e63a0ec314e8315254a8eb414e649d093a107415673a9484a0c741d0839cc8320c83a0ab3a52631d04424a741d08320c6318ea41d093992539d2741ca93b0e7320cb2a4a7521073a4c7398c54b127419262a4c741d0751ce661cc639927428e8414ed4a0c8394e6494a7324e5430c6418e741ce741cc73212c424ec3ad063a0e842526420e631cc649ce639948320c53b0a7598c9299073a4e53a10e42ce6318e729d47531a64250661ca6424c749d473992631d4649d0831525320c639588394c65190a31caa3a0c84390739d0722ceb394e82b0e8320c539d0831ce53ad67299082a125424c7498a7398c931ca9320c759905318c76a4c8520e53a5073a4e84214a3b8c542ce939d0739cc942949320c7598c8518c8599052a9094a10731d065194839d0931ce8324aa39d294a0cb51908320ed5a50a39945419473a1083b1264a0e63a4e7398e64c0ca720c83a0e941cc649d4552168420ae2212742507419074216931948424e949505324e62a0ea424ea3a4a83acea318c5520aa4a8c8494e829d08328e7324e6294e8321093a927324c759d2832527314a639d28420a6295073154a31d0a5a8c739ca63a0e93a1084990c399674150939a073192941cac4192732108399474998a39d0a3a0a63990741d06398a9320e93a0e839cc63a9863a94832506331684194651cc6420e53150a3216649d244a0c5328c84a0c82a54a428a8398c73a4c542506318c75a10841d07595283b4c731909524c7414ea520ca4b1e659cca5a8e6410ab424ac4a4e72a5074a14842147518c5421864a10629d4a3a8e932ce631d453950841d06324e7398c93194862106398cb32108498ea315873a8ca4acc649d0851d4521d05414ad414e541d2a4ad27498a53a5673b8ea398e6324e8319094a0c641d2b49cc63990931ce641d07320a829d4841ca839d6531d0642d693a18939d866190a3a5aa4a106320c74ace6334e73a52a398e63b16949d2741d063212741947320e831dc933cc9520c679cc7418e73a0e82990a320c832106420c76292851d294a126218c739d093950c498c741d0642d06420a93a10631509220a63190749d0639d2849ce74a145320e831d0631ceb424c7419062a8c841cc84190759ce531ca6318e83212b428e8318c76a5053a12831ce7310a63a56731d263a90721ce731d2749d07420a742109590e651d07321074252742146320e74192a3a5c541ce7414ab59d692b128324a8498e73a10642507398e979d0831cc639ce661d47325454192849dc7398e831ce732508618aa49d8639d2c3a58b41526498e7319054a1474a546418ae5a4e53a4a83b1252a5074150731ce631487398ea520c7295083250741cc7325895a0c7418ea498e7424c829ceb49ce8319462a1264b8e9418e6320e553909594e732127498e73990849ca731c8749cc7'

.. an estimated number of:

cardinality_hll(hll_series[ref_a])

162575

Vice versa, we can "update" select original coordinates that intersect with selected geometries by using the index:

def get_selected_coords(gdf_intersect, gdf) -> gp.GeoDataFrame:
    """Return selected coordinates based on index"""
    gdf_intersect.set_index(
        ["index_left0", "index_left1"],
        inplace=True)
    return gdf.loc[gdf_intersect.index]

coords_ref_a = get_selected_coords(gdf_intersect, gdf)

Plot preview

minx, miny, maxx, maxy = gdf_intersect.total_bounds

fig, ax = plt.subplots(1, 1)
ax.set_xlim(minx, maxx)
ax.set_ylim(miny, maxy)
base = world.plot(
    ax=ax, color='none', edgecolor='black', linewidth=0.5)
layers = coords_ref_a.plot(ax=base, markersize=5)
layers.set_axis_off()

Repeat for the other two countries

gdf_intersect = sjoin(
        gdf, geom2, how='right')
hll_series[ref_b] = union_hll(gdf_intersect["user_hll"])
coords_ref_b = get_selected_coords(gdf_intersect, gdf)

gdf_intersect = sjoin(
        gdf, geom3, how='right')
hll_series[ref_c] = union_hll(gdf_intersect["user_hll"])
coords_ref_c = get_selected_coords(gdf_intersect, gdf)

Return distinct errors with error margin (±2.3%)

for ref, hll_set in hll_series.items():
    cardinality = cardinality_hll(hll_set)
    print(f'Estimated {cardinality} (±{int(cardinality/100*2.3)}) users for {name_ref.get(ref)}')

Estimated 162575 (±3739) users for Germany
Estimated 216482 (±4979) users for France
Estimated 300156 (±6903) users for United Kingdom

Plot preview¶

Get the total bounds:

minx, miny, maxx, maxy = pd.concat(
    [coords_ref_a["geometry"], coords_ref_b["geometry"], coords_ref_c["geometry"]]).total_bounds

from typing import List, Optional
def plot_map(
    gdf: gp.GeoDataFrame, sel_coords: List[gp.GeoDataFrame],
    sel_colors: List[str], buf = 100000,
    title: Optional[str] = None, ax = None):
    """Plot GeoDataFrame with matplotlib backend, optionaly export as png"""
    if not ax:
        fig, ax = plt.subplots(1, 1, figsize=(5, 10))
    ax.set_xlim(minx-buf, maxx+buf)
    ax.set_ylim(miny-buf, maxy+buf)
    if title:
        ax.set_title(title, fontsize=12)
    for ix, sel_coord in enumerate(sel_coords):
        sel_coord.plot(
            ax=ax,
            color=sel_colors[ix],
            alpha=1.0,
            markersize=10)
    gdf.plot(
        ax=ax, 
        alpha=0.2,
        markersize=0.1)
    # combine with world geometry
    world.plot(
        ax=ax, color='none', edgecolor='black', linewidth=0.3)
    # ax.imshow(fig, interpolation='nearest', cmap='gray')
    # turn axis off
    ax.set_axis_off()

sel_coords=[coords_ref_a, coords_ref_b, coords_ref_c]
plot_map(
    gdf=gdf, sel_coords=sel_coords, 
    sel_colors=sel_colors,
    title=f'Grid selection for {ref_a.upper()}, {ref_b.upper()} and {ref_c.upper()}')

Calculate hll intersection (common visitors)¶

According to the Union-intersection-principle:

$|A \cup B| = |A| + |B| - |A \cap B|$

which can also be written as:

$|A \cap B| = |A| + |B| - |A \cup B|$

Therefore, unions can be used to calculate intersection. Calculate $|DE \cup US|$, $|DE \cup CA|$ and $|US \cup CA|$, i.e.:

IntersectionCount = 
cardinality_hll(de)::int +
cardinality_hll(us)::int - 
cardinality_hll(hll_union(de, us)

distinct_users_total = {}
distinct_users_total[ref_a] = cardinality_hll(hll_series[ref_a])
distinct_users_total[ref_b] = cardinality_hll(hll_series[ref_b])
distinct_users_total[ref_c] = cardinality_hll(hll_series[ref_c])
distinct_users_total

{'de': 162575, 'fr': 216482, 'uk': 300156}

union_p1 = (hll_series[ref_a], hll_series[ref_b])
union_p2 = (hll_series[ref_a], hll_series[ref_c])
union_p3 = (hll_series[ref_b], hll_series[ref_c])

First, prepare combination for different sets.

hll_sel = {
    f"{ref_a}-{ref_b}": union_p1,
    f"{ref_a}-{ref_c}": union_p2,
    f"{ref_b}-{ref_c}": union_p3
}
distinct_common = {}
for int_tuple, hll_sets in hll_sel.items():
    hll_set = union_hll(
        [hll_sets[0], hll_sets[1]])
    distinct_common[int_tuple] = cardinality_hll(hll_set)
    print(
        f"{distinct_common[int_tuple]} distinct total users "
        f"who shared Flickr photos from either {name_ref.get(int_tuple.split('-')[0])} "
        f"or {name_ref.get(int_tuple.split('-')[1])} (union)")

318865 distinct total users who shared Flickr photos from either Germany or France (union)
427599 distinct total users who shared Flickr photos from either Germany or United Kingdom (union)
452797 distinct total users who shared Flickr photos from either France or United Kingdom (union)

Calculate intersection

distinct_intersection = {}
for a, b in [(ref_a, ref_b), (ref_a, ref_c), (ref_b, ref_c)]:
    a_total = distinct_users_total.get(a)
    b_total = distinct_users_total.get(b)
    common_ref = f'{a}-{b}'
    intersection_count = a_total + b_total - distinct_common.get(common_ref)
    distinct_intersection[common_ref] = intersection_count
    print(
        f"{distinct_intersection[common_ref]} distinct users "
        f"who shared Flickr photos from {name_ref.get(a)} and {name_ref.get(b)} (intersection)")

60192 distinct users who shared Flickr photos from Germany and France (intersection)
35132 distinct users who shared Flickr photos from Germany and United Kingdom (intersection)
63841 distinct users who shared Flickr photos from France and United Kingdom (intersection)

Finally, lets get the number of users who have shared pictures from all three countries, based on the formula for three sets:

$|A \cup B \cup C| = |A| + |B| + |C| - |A \cap B| - |A \cap C| - |B \cap C| + |A \cap B \cap C|$

which can also be written as:

$|A \cap B \cap C| = |A \cup B \cup C| - |A| - |B| - |C| + |A \cap B| + |A \cap C| + |B \cap C|$

union_count_all = cardinality_hll(union_hll(
    [hll_series[ref_a], hll_series[ref_b], hll_series[ref_c]]))
union_count_all

547691

intersection_count_all = union_count_all - \
    distinct_users_total[ref_a] - \
    distinct_users_total[ref_b] - \
    distinct_users_total[ref_c] + \
    distinct_intersection[f'{ref_a}-{ref_b}'] + \
    distinct_intersection[f'{ref_a}-{ref_c}'] + \
    distinct_intersection[f'{ref_b}-{ref_c}']
    
print(intersection_count_all)

27643

Since negative visitor counts are impossible, the increasing inaccuracies for nested intersections are easily observable here. In other words, there is simply too little overlap between the visitors between all three countries to be measurable with HLL intersections

Visualize as Venn diagram¶

Since we're going to visualize this with matplotlib-venn, we need the following variables:

from matplotlib_venn import venn3, venn3_circles
plt.figure(figsize=(3,3))
v = venn3(
    subsets=(
        500,
        500, 
        100,
        500,
        100,
        100,
        10),
    set_labels = ('A', 'B', 'C'))
v.get_label_by_id('100').set_text('Abc')
v.get_label_by_id('010').set_text('aBc')
v.get_label_by_id('001').set_text('abC')
v.get_label_by_id('110').set_text('ABc')
v.get_label_by_id('101').set_text('AbC')
v.get_label_by_id('011').set_text('aBC')
v.get_label_by_id('111').set_text('ABC')
plt.show()

We already have ABC, the other values can be calculated:

ABC = intersection_count_all
ABc = distinct_intersection[f'{ref_a}-{ref_b}'] - ABC
aBC = distinct_intersection[f'{ref_b}-{ref_c}'] - ABC
AbC = distinct_intersection[f'{ref_a}-{ref_c}'] - ABC
Abc = distinct_users_total[ref_a] - ABc - AbC + ABC
aBc = distinct_users_total[ref_b] - ABc - aBC + ABC
abC = distinct_users_total[ref_c] - aBC - AbC + ABC

Define Function to plot Venn Diagram.

from typing import Tuple

def plot_venn(
    subset_sizes: List[int],
    colors: List[str], 
    names: List[str],
    subset_sizes_raw: List[int] = None,
    ax = None,
    title: str = None):
    """Plot Venn Diagram"""
    if not ax:
        fig, ax = plt.subplots(1, 1, figsize=(5,5))
    set_labels = (
        'A', 'B', 'C')
    v = venn3(
        subsets=(
            [subset_size for subset_size in subset_sizes]),
        set_labels = set_labels,
        ax=ax)    
    for ix, idx in enumerate(
        ['100', '010', '001']):
        v.get_patch_by_id(
            idx).set_color(colors[ix])
        v.get_patch_by_id(
            idx).set_alpha(0.8)
        v.get_label_by_id(
            set_labels[ix]).set_text(
            names[ix])
    if subset_sizes_raw:
        for ix, idx in enumerate(
            ['100', '010', None, '001']):
            if not idx:
                continue
            dif_abs = subset_sizes[ix] - subset_sizes_raw[ix]
            dif_perc = dif_abs / (subset_sizes_raw[ix] / 100)
            v.get_label_by_id(idx).set_text(
                f'{subset_sizes[ix]}\n{dif_perc:+.1f}%')            
    label_ids = [
        '100', '010', '001',
        '110', '101', '011',
        '111', 'A', 'B', 'C']
    for label_id in label_ids:
        v.get_label_by_id(
            label_id).set_fontsize(14)
    # draw borders
    c = venn3_circles(
        subsets=(
            [subset_size for subset_size in subset_sizes]),
        linestyle='dashed',
        lw=1,
        ax=ax)
    if title:
        ax.title.set_text(title)

Plot Venn Diagram:

subset_sizes = [
    Abc, aBc, ABc, abC, AbC, aBC, ABC]
names = [
    name_ref.get(ref_a), name_ref.get(ref_b), name_ref.get(ref_c)]
plot_venn(
    subset_sizes=subset_sizes,
    colors=sel_colors,
    names=names,
    title="Common User Count")

Combine Map & Venn Diagram¶

A figure with two subplots (1 row, 2 columns).

fig, ax = plt.subplots(1, 2, figsize=(10, 15), width_ratios=[1, 1])
plot_map(
    gdf=gdf, sel_coords=sel_coords, 
    sel_colors=sel_colors, ax=ax[0])
plot_venn(
    subset_sizes=subset_sizes,
    colors=sel_colors,
    names=names,
    ax=ax[1])
# store as png
fig.savefig(
    Path.cwd().parents[0] / "resources" / f"hll_intersection_{ref_a}{ref_b}{ref_c}.png", dpi=300, format='PNG',
    bbox_inches='tight', pad_inches=0)

Each dot on the left side is a coordinate shared in the original dataset, with an attached HLL set abstracting all users having shared photographs from this part of the world. By union of all HLL sets for each of the three country, based on coordinate-country inersection, we were able to calculate the intersection between different sets. These estimated common visitor counts are labeled on the right side in the Venn diagram.

The number of distinct users who shared photos from UK outweighs Germany and France.There is a slightly larger overlap between common visitors for France and UK. By union of HLL sets, we are able to estimate distinct visitor counts for arbitrary areas or regions. This ability to re-use HLL sets is, however, limited by the lower limit of 50km resolution of the shared benchmark dataset.

Create notebook HTML¶

!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./intro.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&-

	latitude	longitude	user_hll
0	0.040050	-179.750586	\x138b400e0116e13ec2414152615441a6a1b206c544e3...
1	-0.364335	-179.752800	\x138b40cbe1
2	-0.768727	-179.760539	\x138b407ee3
3	0.848830	-177.767038	\x138b406641
4	-0.364335	-177.757221	\x138b4039c1

		user_hll	geometry
latitude	longitude
0.040050	-179.750586	\x138b400e0116e13ec2414152615441a6a1b206c544e3...	POINT (-18015096.000 4952.000)
-0.364335	-179.752800	\x138b40cbe1	POINT (-18015096.000 -45048.000)
-0.768727	-179.760539	\x138b407ee3	POINT (-18015096.000 -95048.000)
0.848830	-177.767038	\x138b406641	POINT (-17815096.000 104952.000)
-0.364335	-177.757221	\x138b4039c1	POINT (-17815096.000 -45048.000)

	user_hll	usercount	geometry
continent
Antarctica	\x138b40e142	NaN	POINT (-0.000 -9020047.848)
Antarctica	\x138b4004a41f81220330a132e4340135443a62446146...	NaN	POINT (-0.000 -9020047.848)
Antarctica	\x138b40d241	NaN	POINT (-0.000 -9020047.848)
Antarctica	\x138b40f601	NaN	POINT (-5915096.000 -8345048.000)
Antarctica	\x138b40adc4	NaN	POINT (-5715096.000 -8145048.000)

	geometry
continent
Africa	MULTIPOLYGON (((3935404.022 -844960.870, 39354...
Antarctica	MULTIPOLYGON (((-5823349.028 -8432831.992, -59...
Asia	MULTIPOLYGON (((12350738.789 -1251249.154, 123...
Europe	MULTIPOLYGON (((-5168883.074 513724.090, -5168...
North America	MULTIPOLYGON (((-7697776.465 1070494.583, -769...