Preparation: Raw->Hll Conversion

Alexander Dunkel, Institute of Cartography, TU Dresden

•••
Out[4]:

Last updated: Aug-18-2023, Carto-Lab Docker Version 0.14.0

Abstract preparation for visualization of temporal patterns for ephemeral events.

Prepare environment

To run this notebook, as a starting point, you have two options:

1. Create an environment with the packages and versions shown in the following cell.

As a starting point, you may use the latest conda environment_default.yml from our CartoLab docker container.

2. If docker is available to to, we suggest to use the Carto-Lab Docker Container

Clone the repository and edit your .env value to point to the repsitory, where this notebook can be found, e.g.:

git clone https://gitlab.vgiscience.de/lbsn/tools/jupyterlab.git
cd jupyterlab
cp .env.example .env
nano .env
## Enter:
# JUPYTER_NOTEBOOKS=~/notebooks/ephemeral_events
# TAG=v0.12.3
docker network create lbsn-network
docker-compose pull && docker-compose up -d
•••
List of package versions used in this notebook
package python Fiona Shapely bokeh colorcet geopandas geoviews holoviews hvplot ipywidgets
version 3.9.15 1.8.20 1.7.1 2.4.3 3.0.1 0.13.2 1.9.5 1.14.8 0.8.4 8.0.7
package mapclassify matplotlib matplotlib-venn numpy pandas python-dotenv xarray
version 2.5.0 3.7.1 0.11.9 1.22.4 2.0.3 1.0.0 2023.6.0

Load dependencies:

In [6]:
import os, sys
from pathlib import Path
import psycopg2
import geopandas as gp
import pandas as pd
import matplotlib.pyplot as plt
from typing import List, Tuple, Dict, Optional
from IPython.display import clear_output, display, HTML

To reduce the code shown in this notebook, some helper methods are made available in a separate file.

Load helper module from ../py/modules/base/tools.py.

In [7]:
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
from modules.base import tools

Activate autoreload of changed python files:

In [8]:
%load_ext autoreload
%autoreload 2

Parameters

Define initial parameters that affect processing

In [9]:
WORK_DIR = Path.cwd().parents[0] / "tmp"     # Working directory                     
OUTPUT = Path.cwd().parents[0] / "out"       # Define path to output directory (figures etc.)
In [10]:
for folder in [WORK_DIR, OUTPUT]:
    folder.mkdir(exist_ok=True)

Load dotfiles environment variables

In [11]:
from dotenv import load_dotenv
load_dotenv(
    Path.cwd().parents[0] / '.env', override=True)
Out[11]:
True
In [12]:
DB_NAME_RAWDB = os.getenv("DB_NAME_RAWDB")    # lbsn-rawdb name
DB_HOST_RAWDB = os.getenv("DB_HOST_RAWDB")    # lbsn-rawdb name

Raw to HLL conversion

In [13]:
db_user = "postgres"
db_pass = os.getenv('POSTGRES_PASSWORD')
db_host = "lbsn-hlldb"
db_port = "5432"
db_name = "hlldb"
In [14]:
db_connection_hll = psycopg2.connect(
        host=db_host,
        port=db_port,
        dbname=db_name,
        user=db_user,
        password=db_pass
)
db_conn_hll = tools.DbConn(db_connection_hll)
cur_hll = db_connection_hll.cursor()
cur_hll.execute("SELECT 1;")
print(cur_hll.statusmessage)
SELECT 1

Simplify query access:

In [15]:
db_conn = tools.DbConn(db_connection_hll)
db_conn.query("SELECT 1;")
Out[15]:
?column?
0 1
If any SQL results in an error, the cursor cannot be used again. In this case, run db_connection.rollback() once, to reset the cursor.
db_connection_hll.rollback()

Create Query Schema

Create a new schema called mviews and update Postgres search_path, to include new schema:

In [13]:
sql_query = """
CREATE SCHEMA IF NOT EXISTS mviews;
ALTER DATABASE hlldb
SET search_path = "$user",
                  social,
                  spatial,
                  temporal,
                  topical,
                  interlinkage,
                  extensions,
                  mviews;"""

Since the above query will not return any result, we'll directly use the psycopg2 cursor object:

In [14]:
cur = db_connection_hll.cursor()
cur.execute(sql_query)
print(cur.statusmessage)
ALTER DATABASE

By using Foreign Table, this step will establish the connection between hlldb to rawdb.

On hlldb, install postgres_fdw extension:

In [14]:
sql_query = """
CREATE EXTENSION IF NOT EXISTS postgres_fdw SCHEMA extensions;
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
CREATE EXTENSION

Check if foreign table has been imported already:

In [15]:
result = tools.check_table_exists(db_conn_hll, 'flickr_all_reduced')
print(result)

Conditional load password - this only need to be done once, if the server hasn't been added before.

In [17]:
if not result:
    import getpass
    USER_KEY = getpass.getpass()

Create Foreign Server connection to rawdb, on hlldb:

In [35]:
sql_query = f"""
CREATE SERVER IF NOT EXISTS lbsnraw 
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (
    host '{DB_NAME_RAWDB}',
    dbname '{DB_HOST_RAWDB}',
    port '5432',
    keepalives '1',
    keepalives_idle '30',
    keepalives_interval '10',
    keepalives_count '5',
    fetch_size '500000');
CREATE USER MAPPING IF NOT EXISTS for postgres
    SERVER lbsnraw 
    OPTIONS (user 'lbsn_reader', password '{USER_KEY}');
"""
if not result:
    cur_hll.execute(sql_query)
    print(cur_hll.statusmessage)
CREATE USER MAPPING

Import foreign table definition on the hlldb.

In [36]:
sql_query = f"""
IMPORT FOREIGN SCHEMA mviews
    LIMIT TO (
        flickr_all_reduced)
    FROM SERVER lbsnraw 
    INTO mviews;
"""
# only import table 
# if it hasn't been imported already
if not result:
    cur_hll.execute(sql_query)
    print(cur_hll.statusmessage)
IMPORT FOREIGN SCHEMA

test

In [52]:
db_conn.query("SELECT * FROM mviews.flickr_all_reduced LIMIT 10;")
Out[52]:
post_guid user_guid month year
0 DfzHh1byGZK9+3yqfLYKtoBFlvur7r1/WnlZk0X2pKo 0hgydU2DQr8mjkg3czy5TBw04neL6LgarKCOE31WFwI 7.0 2013.0
1 4Bq09lrS7oWuSe+iEVBa2KWWGmN08VOrNW7yRWJlm6A iurjphR53OAR9i1JUJW4nI3Ih5X9FLTLwdDxL/gVUFM 11.0 2014.0
2 Ni+GRlq6eByl1Q4BTNQA3pFs/1EhqThuJnE/qANsZrw tgPi7WsqslyC5+mBNCO1UB3w5QFop+eC3JeONdPudsA 1.0 2020.0
3 jWEQXNohR20XxR3peDFTwY7iTmgZuBgxM1FCt84MBx4 4zGAbVo5Vx7E3mSRslaC2J0MQqgvPp2nGbHOXPe6+kc 8.0 2015.0
4 aPTDt3gFRwLbNAb3YnrxVjhq1GhQrKrexOAZCGOCEJI 4zGAbVo5Vx7E3mSRslaC2J0MQqgvPp2nGbHOXPe6+kc 8.0 2015.0
5 yNCgTnE6/C8Er8lS+aIGZumgI2/9V3VXX1PPuLvuqHk CgwD+K4awz1S6YQBrejkUzxypvcWQgX1Uce0HGvy88U 1.0 2011.0
6 dahQo84wP0pOcGBOu7RJrrxGnceYlnL+BoTH2rNAxkY tgPi7WsqslyC5+mBNCO1UB3w5QFop+eC3JeONdPudsA 1.0 2020.0
7 8X0TI0AfEDt8+kdpqV5na528ktMWpNsIFPJj6KgvUKA yRd3gtZ0NzSobTc85wAyCfa7Dzm/92GalR40Fuw6yEM 1.0 2009.0
8 x+onFV4uS6ronrcL0Q68PeP6A6hf+X5bL1h06WkjEjE tgPi7WsqslyC5+mBNCO1UB3w5QFop+eC3JeONdPudsA 1.0 2020.0
9 O4Xuh2YrH6Y6a+mO9YjjdRGD7+NuR/Hy8CybDWoc2Es tgPi7WsqslyC5+mBNCO1UB3w5QFop+eC3JeONdPudsA 1.0 2020.0

Commit changes to hlldb

In [38]:
db_connection_hll.commit()

Prepare conversion of raw data to hll

HyperLogLog parameters

The HyperLogLog extension for Postgres from Citus that we're using here, contains several tweaks, to optimize performance, that can affect sensitivity of data.

From a privacy perspective, for example, it is recommended to disable explicit mode.

Explicit mode? When explicit mode is active, full IDs will be stored for small sets. In our case, any coordinates frequented by few users (outliers) would store full user and post IDs.

To disable explicit mode:

In [26]:
db_conn_hll.query("SELECT hll_set_defaults(11, 5, 0, 1);")
Out[26]:
hll_set_defaults
0 (11,5,-1,1)

Aggregation step: Convert data to Hll

In [61]:
def materialized_view_hll(table_name_src: str, table_name_dest, schema: str = None, additional_cols: [str] = None) -> str:
    """Returns raw SQL for creating a materialized view with HLL aggregate"""
    if not schema:
        schema = 'mviews'
    return f"""
        CREATE MATERIALIZED VIEW {schema}.{table_name_dest} AS
            SELECT 
                month,
                year,
                hll_add_agg((hll_hash_text(post_guid))) AS "post_hll",
                hll_add_agg((hll_hash_text(user_guid))) AS "user_hll"
                {''.join([f",{x}" for x in additional_cols if len(additional_cols) > 0])}
            FROM {schema}.{table_name_src}
            GROUP BY year,month{''.join([f",{x}" for x in additional_cols if len(additional_cols) > 0])}
            ORDER BY 
            year ASC,
            month ASC;
        """
In [54]:
%%time
sql_query = materialized_view_hll(
    table_name_src="flickr_all_reduced", table_name_dest="flickr_all_months")
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
SELECT 207

Test:

In [55]:
db_conn.query("SELECT * FROM mviews.flickr_all_months LIMIT 10;")
Out[55]:
month year post_hll user_hll
0 1.0 1970.0 \x138b4006e432653ea153e15a835e4168e279e1ad21dd... \x138b40da21
1 3.0 2005.0 \x138b40308549657da28201ee41f1e1 \x138b40e1e1
2 6.0 2005.0 \x138b4046c1 \x138b4025a1
3 7.0 2005.0 \x138b406ea3baa1 \x138b4000041ea2
4 8.0 2005.0 \x138b4027625181b9e1 \x138b4039a1cd23f101
5 9.0 2005.0 \x138b4027a23564506157638482ae43b8a2 \x138b4000041722a9e3
6 10.0 2005.0 \x138b4013a11921206130e53101372143c4538158816d... \x138b4017229943ac61b541ba62e541e663
7 11.0 2005.0 \x138b4002c21ae32501330236c1474347c24c624f6363... \x138b402d813de145e28621b841
8 12.0 2005.0 \x138b4002c103a1054306a207010bc60ee111c2168217... \x138b40000406e12d813de15ee1a862cda4e541
9 1.0 2006.0 \x138b400041018105c307010ec31ac12b612cc22dc13b... \x138b402b212d813a01b841cbe1d624e541ea01f945fcc2
In [56]:
db_connection_hll.commit()

Export data as CSV

Save hll data to CSV. The following records are available from table spatial.latlng:

  • year - distinct year
  • month - month
  • post_hll - approximate post guids stored as hll set
  • user_hll - approximate user guids stored as hll set
In [71]:
sql_query = f"""
    SELECT  year,
            month,
            post_hll,
            user_hll
    FROM mviews.flickr_all_months;
    """
df = db_conn.query(sql_query)
# use type int instead of float
for col in ["year", "month"]:
    df[col] = df[col].astype(int)
In [72]:
df.head()
Out[72]:
year month post_hll user_hll
0 1970 1 \x138b4006e432653ea153e15a835e4168e279e1ad21dd... \x138b40da21
1 2005 3 \x138b40308549657da28201ee41f1e1 \x138b40e1e1
2 2005 6 \x138b4046c1 \x138b4025a1
3 2005 7 \x138b406ea3baa1 \x138b4000041ea2
4 2005 8 \x138b4027625181b9e1 \x138b4039a1cd23f101
In [73]:
usecols = ["year", "month", "post_hll", "user_hll"]
df.to_csv(
    OUTPUT / "flickr_all_months.csv",
    mode='w', columns=usecols,
    index=False, header=True)

Import Milvus milvus and Bloom examples

Two thematic queries have been collected to classify subsets of Flickr photos, Milvus milvus related imagery, based on the terms 'red_kite', 'redkite', 'milvusmilvus', 'milvus_milvus', 'milvus' and Bloom related imagery, based on the terms 'bloom', 'blossom', 'flower'.

In addition, the expected data ("all") for the Milvus range (Europe) boundary from iNaturalist is processed.

Select the Parameter below

In [32]:
# topic = "flickr_milvusmilvus"
# topic_ref = "flickr_milvusmilvus_reduced"
# topic = "flickr_bloom"
# topic_ref = "flickr_bloom"
In [16]:
topic = "milvus_range_inat_all"
topic_ref = "milvus_range_inat"

Process topic_ref:

In [18]:
result = tools.check_table_exists(db_conn_hll, f'{topic_ref}')
print(result)
False
In [19]:
sql_query = f"""
IMPORT FOREIGN SCHEMA mviews
    LIMIT TO (
        {topic_ref})
    FROM SERVER lbsnraw 
    INTO mviews;
"""
# only import table 
# if it hasn't been imported already
if not result:
    cur_hll.execute(sql_query)
    print(cur_hll.statusmessage)
IMPORT FOREIGN SCHEMA
In [20]:
db_conn.query(f"SELECT * FROM mviews.{topic_ref} LIMIT 10;")
Out[20]:
post_guid user_guid month year
0 u7ER5K+y/i3cX/0ZOSI3jWpC5gfDRlEdTFrh4WqFQu0 bR8JvlDf/OTF8Y7zA+QK5iKd36ebkLvXffBtE6qpvUQ 2.0 2020.0
1 4kJjTAh/J9reyj5toO2CxS2mSRP7fbkHUHzKx3PswmY P2ZyUiqfk9dNH8ieP1FMKEbQZuePkEKm4bF60BIO0lA 4.0 2016.0
2 HhKZ65oIB3jZ6p5IN0GHLkLmS84w+P/Rs+YrW997vJI UHBdXDQ3sXwNRYSYIYvdmXLGjpeg9Kn+HOIcxOYv85A 2.0 2020.0
3 ANChnIhgq7r+38tdcE7nAZ6MkzPjkqFFOFewy7CCnkA P2ZyUiqfk9dNH8ieP1FMKEbQZuePkEKm4bF60BIO0lA 4.0 2016.0
4 5ZUv8ZmrFmKW6qwxbWTcyMhVEI4BNWiqVB6MvQaq2PM UHBdXDQ3sXwNRYSYIYvdmXLGjpeg9Kn+HOIcxOYv85A 2.0 2020.0
5 +KUSUucJBtEQuH6XzPmni8GnbCIGX8hEcjjVSp9Zrxk UHBdXDQ3sXwNRYSYIYvdmXLGjpeg9Kn+HOIcxOYv85A 2.0 2020.0
6 9ouYH1a4ZQVFhEoODk4GgzhfcCv7R8T0MX7WoF2byvI UHBdXDQ3sXwNRYSYIYvdmXLGjpeg9Kn+HOIcxOYv85A 2.0 2020.0
7 zVjH3Hg4okgFrjQDCuBv35A1OoemcZqNKHLbVUbY4Vw P2ZyUiqfk9dNH8ieP1FMKEbQZuePkEKm4bF60BIO0lA 4.0 2016.0
8 ePgRPd28YtlHj4szuTAUAs31lg4G/b/oEJx6+Wr+8gw UHBdXDQ3sXwNRYSYIYvdmXLGjpeg9Kn+HOIcxOYv85A 2.0 2020.0
9 pDhNLbmhhYEcbDTp74fAT9ttLSab+h/Det4P0AbSNu8 P2ZyUiqfk9dNH8ieP1FMKEbQZuePkEKm4bF60BIO0lA 4.0 2016.0
In [23]:
%%time
sql_query = materialized_view_hll(
    table_name_src=topic_ref, table_name_dest=f"{topic}_months")
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
SELECT 701
CPU times: user 1.34 ms, sys: 556 µs, total: 1.9 ms
Wall time: 19.5 s
In [24]:
db_connection_hll.commit()
In [25]:
sql_query = f"""
    SELECT  year,
            month,
            post_hll,
            user_hll
    FROM mviews.{topic}_months;
    """
df = db_conn.query(sql_query)
# use type int instead of float
for col in ["year", "month"]:
    df[col] = df[col].astype(int)
In [26]:
usecols = ["year", "month", "post_hll", "user_hll"]
df.to_csv(
    OUTPUT / f"{topic}_months.csv",
    mode='w', columns=usecols,
    index=False, header=True)

iNaturalist

Check if foreign table has been imported already:

In [27]:
table_ref = 'inaturalist_all_reduced'
In [28]:
result = tools.check_table_exists(db_conn_hll, table_ref)
result
Out[28]:
False

Import foreign table definition on the hlldb.

In [36]:
sql_query = f"""
IMPORT FOREIGN SCHEMA mviews
    LIMIT TO (
        {table_ref})
    FROM SERVER lbsnraw 
    INTO mviews;
"""
# only import table 
# if it hasn't been imported already
if not result:
    cur_hll.execute(sql_query)
    print(cur_hll.statusmessage)
IMPORT FOREIGN SCHEMA
In [37]:
db_conn.query(f"SELECT * FROM mviews.{table_ref} LIMIT 10;")
Out[37]:
post_guid user_guid month year
0 IYYGdyNruUAieZvpRYxOrrxt2WxQNm/JTW63bLAnp/o 8Un6TW1b8CKLP6v3BXe0zuQq7FYLfVQRVGGPkmPzASw 8.0 2022.0
1 TAOy6fHFZ3g8JTdf+dvYv9ATYRtzImP7Ai1fmeTw7Fo LRe0XgrGIMwxm/nkzsYxuLaJ+xDLfcXoTPp9bXvsMko 9.0 2021.0
2 sK+cyBerK1dLue59Z7hh6ZordfeuOSQHyth9k+uYJx4 Dk8doY+bwJ4z8QIeAKVCnVNDLIK2mgSCHMQdaSF8bqI 3.0 2021.0
3 cjp8sxe/lIEkYKNZvFnumuL8OJ1uj6Bn02pTT5QW26c Dk8doY+bwJ4z8QIeAKVCnVNDLIK2mgSCHMQdaSF8bqI 10.0 2022.0
4 kYG6IxHwleKO1NOZF1bajwXZt75RNaW+eIAzX9zAZRM rkhXtNceVLL9XLqjCwVHBeXxgAdPcxksmAS3zjrlLpE 7.0 2021.0
5 piNLz2R1ukJx8ZIIVQdwUSxSq6i6bA+R152QkTnjH6U W6CMMXwBvduoQ1JkGfljBb64C7XFXnmfWxWQBod/2C0 7.0 2021.0
6 dsUWw/pBTK1KEXUhQeJ2uSdcjci/JyRQ1NpB6arlyQQ 0QVKi5wlM5yWcJf4GAUKA+R49PXQyRMTFXW/xe8NcpI 7.0 2017.0
7 7zuwuFIuqyZdrfeedS4NCDR8KNZD+mcfKJan4D3czNw Nk4OCDJKkr1QVyFTVvl64DKOQ8jAa29GnW1HLp/MJjY 7.0 2022.0
8 g312t8bUfKxwXeZ4orsnLUNwABqMW2hP7SgYX075fhw I6mKObEflrm1f756ofeh9TW4GKd4HTGqlllyNpYCevI 5.0 2021.0
9 IWzp2XaSLl+hFblJOKBrkUz1+LEznKDeb0VLraupc1Q W6CMMXwBvduoQ1JkGfljBb64C7XFXnmfWxWQBod/2C0 7.0 2021.0
In [38]:
db_connection_hll.commit()
In [40]:
table_ref_dest = "inaturalist_all_months"
In [39]:
%%time
sql_query = materialized_view_hll(
    table_name_src=table_ref, table_name_dest=table_ref_dest)
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
SELECT 1112
In [57]:
db_connection_hll.commit()
In [55]:
sql_query = f"""
    SELECT  year,
            month,
            post_hll,
            user_hll
    FROM mviews.{table_ref_dest};
    """
df = db_conn.query(sql_query)
# use type int instead of float
time_cols = ["year", "month"]
# drop where time cols are invalid 
df.dropna(subset=time_cols, inplace=True)
# turn float to int
for col in time_cols:
    df[col] = df[col].astype(int)
# we can also remove any rows where the year is < 2007
df.drop(df[df['year'] < 2007].index, inplace = True)
In [56]:
df
Out[56]:
year month post_hll user_hll
914 2007 1 \x148b7f0002208c400984319461188461846508861100... \x148b7f00c22000010040200000000220046008020004...
915 2007 2 \x148b7f0086229481104a311082104801048431c42204... \x148b7f08802000610000000000000020002000020004...
916 2007 3 \x148b7f110a500c4320c83284822048340882110e4090... \x148b7f00800001000082000801100400004208020100...
917 2007 4 \x148b7f188a43104210c6310c6320c8511ca330c24294... \x148b7f08cc0100630802000824084420800200020104...
918 2007 5 \x148b7f20c8319041288451096230c621146318c64224... \x148b7f08c20100600002008804084220804200c41084...
... ... ... ... ...
1106 2023 1 \x148b7f5a56c4a1675290d529276210d535285a149531... \x148b7f318c521d0638ca8391452a4c62106639065490...
1107 2023 2 \x148b7f3a1a959d68629084a1a9425a94a1ad4a14b529... \x148b7f418e42150640ce821d462a8c4218e521066228...
1108 2023 3 \x148b7f5296a42d8942d6b7250a4a9694a92b4a90c425... \x148b7f318c621504298e839d462a4c42b4e719069210...
1109 2023 4 \x148b7f5316b52d2e42dab5a94c52d2c4292a4b56a5a9... \x148b7f318c62990629508210c629c88234c7514c53a8...
1110 2023 5 \x148b7f21cc93150c4a10831d2552489320e639cc7314... \x148b7f208c218c83218e8108e628c242346311083314...

197 rows × 4 columns

In [58]:
usecols = ["year", "month", "post_hll", "user_hll"]
df.to_csv(
    OUTPUT / f"{table_ref_dest}.csv",
    mode='w', columns=usecols,
    index=False, header=True)

Focus Region

Difference below: With origin_id

In [45]:
table_ref = 'milvus_focus_flickr_inat'
In [46]:
result = tools.check_table_exists(db_conn_hll, table_ref)
result
Out[46]:
True

Import foreign table definition on the hlldb.

In [47]:
sql_query = f"""
IMPORT FOREIGN SCHEMA mviews
    LIMIT TO (
        {table_ref})
    FROM SERVER lbsnraw 
    INTO mviews;
"""
# only import table 
# if it hasn't been imported already
if not result:
    cur_hll.execute(sql_query)
    print(cur_hll.statusmessage)
In [48]:
db_conn.query(f"SELECT * FROM mviews.{table_ref} LIMIT 10;")
Out[48]:
post_guid user_guid origin_id month year
0 8B2FIRS/0mgnVg6z2YQHU5P2CcWEw2x1T8c0guBZ9iU uTQkOfUXKrNKHjZd+YK9mQNqnjJBm+IURshJuCMF9tE 23 3.0 2019.0
1 xA//kk8DouX/axdn/lO4PikGbrUaWLiLqpT3pkNTAb8 uTQkOfUXKrNKHjZd+YK9mQNqnjJBm+IURshJuCMF9tE 23 1.0 2020.0
2 t7fKeieGyS/hbEmcXcZBDQrHJmpIYLSGKZ5OGEcZLCM smkc+6zAScI7sAZo7tTTB3EUXsPEoEgjUCQ7HPf7XEc 23 4.0 2021.0
3 RgsfNBIpaCK4YEbeHjloPx0v2/cQihw9emDf3UNPHww 57hhQShuF50ETz7XemrywmlbyrPYy3pvdkXkssk093o 23 6.0 2019.0
4 i0l1t9CYrjt0LJdy1okw8DSr1RbNC01Shmpe1H0xgSQ eMOgQd2Anucsy0N1mc41Vz5bVmzns3BttPjlhvBK4Kg 23 11.0 2019.0
5 w+qWV+xFn+aPxZ+Znu58wAxxXybu4PQ4hV8pJXQt7u8 uTQkOfUXKrNKHjZd+YK9mQNqnjJBm+IURshJuCMF9tE 23 2.0 2020.0
6 Ajmaumh6wRpAJN2XruylxGoPQGE4GYmMmnJRc5L9//4 BVvZ6S68yRuhoWXEsGMArjtswRQ07e4FGPoWmgjQnkg 23 3.0 2020.0
7 N8xKZhtyhU5fnOfViSvI0dl9NckMsnZ/1vs7Cz/WUvY U7lDLVBV0cO4MU9HpY02QA0+zKnUfwjW3Qpu4xG3bTs 23 5.0 2021.0
8 HCp+mRxVw2tLPg3PjsPDcd4T3t/ZVr84WFdP5ejL6n8 uTQkOfUXKrNKHjZd+YK9mQNqnjJBm+IURshJuCMF9tE 23 6.0 2021.0
9 Hd/FQydDFpxHUlgW8Nr3JP+DiFuF+NlB4wUp4GQlo+g eCF1t3m1gZ3IcH1kVo3bQhaNuOY/glbT8cfpE3sKvTs 23 6.0 2022.0
In [35]:
db_connection_hll.commit()
In [49]:
table_ref_dest = "milvus_focus_flickr_inat_all_months"
In [66]:
%%time
sql_query = materialized_view_hll(
    table_name_src=table_ref, table_name_dest=table_ref_dest, additional_cols=['origin_id'])
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
SELECT 1026
CPU times: user 1.11 ms, sys: 410 µs, total: 1.52 ms
Wall time: 4.3 s
In [67]:
db_connection_hll.commit()
In [68]:
sql_query = f"""
    SELECT  year,
            month,
            origin_id,
            post_hll,
            user_hll
    FROM mviews.{table_ref_dest};
    """
df = db_conn.query(sql_query)
# use type int instead of float
time_cols = ["year", "month"]
# drop where time cols are invalid 
df.dropna(subset=time_cols, inplace=True)
# turn float to int
for col in time_cols:
    df[col] = df[col].astype(int)
# we can also remove any rows where the year is not 2007 < x < 2023
df.drop(df[(df['year'] < 2007) | (df['year'] > 2022)].index, inplace = True)
In [69]:
df
Out[69]:
year month origin_id post_hll user_hll
630 2007 1 2 \x148b7f00c24000410800218441100000048208422020... \x138b7f00610223028202c2046504a2058206c2094109...
631 2007 2 2 \x148b7f190a608c801080308c8210c81084401908300c... \x138b7f00040465062106e209c10a430aa10b040d220e...
632 2007 3 2 \x148b7f20c21184031942208481210461086120021094... \x138b7f00040043008101c1028203c107c109c10aa30b...
633 2007 3 23 \x128b7fbf7dc7d226e191ba4f4af095ed1ac23b \x128b7fb6f0f487e76c9f29
634 2007 4 2 \x148b7f18c422896300863204240a0e01902200023100... \x138b7f000401e10282032203c103e20404080808c409...
... ... ... ... ... ...
991 2022 8 23 \x148b7f11443100230088018044110a20888318462100... \x138b7f00040061022302e203e2044104660482050205...
992 2022 9 23 \x148b7f18061108000006308840088a60000108442188... \x138b7f0021022102a102e2036103e20441046604c105...
993 2022 10 23 \x148b7f204416042010440008a008080280a200c08084... \x138b7f022302e2036103a10466048204c1060106a306...
994 2022 11 23 \x148b7f0004018c442000010403000000000200403200... \x138b7f00040021022202e203610466050205a2068106...
995 2022 12 23 \x138b7f000100840181024202610381048104e2052305... \x128b7f8126755bbf8e803383ae678593099b83862708...

366 rows × 5 columns

In [70]:
usecols = ["year", "month", "origin_id", "post_hll", "user_hll"]
df.to_csv(
    OUTPUT / f"{table_ref_dest}.csv",
    mode='w', columns=usecols,
    index=False, header=True)

Create notebook HTML

In [59]:
!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./00_raw_hll_conversion.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&-