Wikidata: Events SPARQL Query

Alexander Dunkel, Institute of Cartography, TU Dresden

•••
Out[1]:

Last updated: Jul-26-2023, Carto-Lab Docker Version 0.14.0

Visualization of events (for Nevada example) queried from Wikidata using SPARQL.

Preparations

Create environment

In [4]:
!python -m venv /envs/wikidata_venv

Install qwikidata in a venv and link the Python Kernel to Jupyter Lab.

In [9]:
%%bash
if [ ! -d "/envs/wikidata_venv/lib/python3.10/site-packages/qwikidata" ]; then
    /envs/wikidata_venv/bin/python -m pip install qwikidata ipykernel pandas > /dev/null 2>&1
else
  echo "Already installed."
fi
# link
if [ ! -d "/root/.local/share/jupyter/kernels/qwikidata" ]; then
    echo "Linking environment to jupyter"
    /envs/wikidata_venv/bin/python -m ipykernel install --user --name=qwikidata
else
  echo "Already linked."
fi
Already installed.
Linking environment to jupyter
Installed kernelspec qwikidata in /root/.local/share/jupyter/kernels/qwikidata

Hit F5 and select the qwikidata Kernel on the top-right corner of Jupyter Lab.

See the package versions used below.

•••
List of package versions used in this notebook
package python ipykernel pandas qwikidata
version 3.10.12 6.24.0 2.0.3 0.4.2

Query wikidata using SPARQL

import dependencies

In [1]:
import csv
import pandas as pd
from qwikidata.sparql import return_sparql_query_results

Define query:

  • use distance query to Nevada (centroid)
  • filter based on country geometry is done later in Geopandas
  • see SPARQL examples here and here

Parameters

There are two parameters that needs modification, the entity name that is used to get the centroid (location), for filtering based on geodistance (the second parameter).

In [5]:
## Example 1:
# loc_name = "Nevada"
# entity = "Q1227"
# geodistance = 400

## Example 2:
loc_name = "Leipzig"
geodistance = 80
entity = "Q2079" # Leipzig, Germany
In [3]:
sparql_query = f"""
#title: All events in {loc_name}, based on distance query ({geodistance})
SELECT ?event ?eventLabel ?date ?location ?eventDescription
WITH {{
  SELECT DISTINCT ?event ?date ?location
  WHERE {{
    # find events
    wd:{entity} wdt:P625 ?loc_ref. 
    ?event wdt:P31/wdt:P279* wd:Q1190554.
           # wdt:P17 wd:Q30;
    # with a point in time or start date
    OPTIONAL {{ ?event wdt:P585 ?date. }}
    OPTIONAL {{ ?event wdt:P580 ?date. }}
    ?event wdt:P625 ?location.
    FILTER(geof:distance(?location, ?loc_ref) < {geodistance}).
  }}
  LIMIT 5000
}} AS %i
WHERE {{
  INCLUDE %i
  SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en,de" .}}
}}
"""
In [4]:
%%time
result = return_sparql_query_results(sparql_query)
CPU times: user 26.7 ms, sys: 3.29 ms, total: 30 ms
Wall time: 47.2 s

Format and convert to pandas DataFrame

In [5]:
import dateutil.parser

event_list = []
for event in result["results"]["bindings"]:
    date_val = event.get('date')
    if date_val:
        date_val = date_val.get('value')
        date_val = pd.to_datetime(dateutil.parser.parse(date_val), errors = 'coerce')
    event_desc = event.get('eventDescription')
    if event_desc:
        event_desc = event['eventDescription']['value']
    event_tuple = (
        event['event']['value'],
        event['eventLabel']['value'],
        date_val,
        event['location']['value'],
        event_desc)
    event_list.append(event_tuple)
In [6]:
df = pd.DataFrame(event_list, columns=result['head']['vars'])
In [7]:
df.head()
Out[7]:
event eventLabel date location eventDescription
0 http://www.wikidata.org/entity/Q16854674 MENALIB NaT Point(11.97 51.49) Webportal des Fachinformationsdienstes Nahost-...
1 http://www.wikidata.org/entity/Q24730444 adlr.link NaT Point(12.368194444 51.3325) web portal of the Specialised Information Serv...
2 http://www.wikidata.org/entity/Q28245008 Specialised Information Service Middle East, N... NaT Point(11.97 51.49) MENALIB – Web Portal of the Specialised Inform...
3 http://www.wikidata.org/entity/Q65952781 Staatliche Studienakademie Leipzig NaT Point(12.30261 51.31028) None
4 http://www.wikidata.org/entity/Q96623670 Staatliche Studienakademie Riesa NaT Point(13.28919 51.31631) None
In [8]:
print(len(df))
69

Store to disk

In [9]:
from pathlib import Path
OUTPUT = Path.cwd().parents[0] / "out" 
df.to_pickle(OUTPUT / f"wikidata_events_{loc_name.lower()}.pkl") 

Visualize on a map

Select worker_env as the visualization environment.

In [1]:
%load_ext autoreload
%autoreload 2

Load dependencies

In [2]:
import sys
import pandas as pd
import geopandas as gp
from pathlib import Path
from shapely.geometry import Point
from shapely import wkt
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
from modules.base import tools
•••
List of package versions used in this notebook
package python Shapely geopandas pandas
version 3.9.15 1.7.1 0.13.2 2.0.3
In [6]:
OUTPUT = Path.cwd().parents[0] / "out" 
df = pd.read_pickle(OUTPUT / f"wikidata_events_{loc_name.lower()}.pkl") 
In [7]:
CRS_WGS = "epsg:4326"

df['geometry'] = df.location.apply(wkt.loads)
gdf = gp.GeoDataFrame(df, crs=CRS_WGS)

Get Shapefile for US States/ Germany

In [8]:
if loc_name == "Nevada":
    source_zip = "https://www2.census.gov/geo/tiger/GENZ2018/shp/"
    filename = "cb_2018_us_state_5m.zip"
    shapes_name = "cb_2018_us_state_5m.shp"
elif loc_name == "Leipzig":
    source_zip = "https://daten.gdz.bkg.bund.de/produkte/vg/vg2500/aktuell/"
    filename = "vg2500_12-31.utm32s.shape.zip"
    shapes_name = "vg2500_12-31.utm32s.shape/vg2500/VG2500_LAN.shp"
In [9]:
SHAPE_DIR = (OUTPUT / "shapes")
SHAPE_DIR.mkdir(exist_ok=True)

if not (SHAPE_DIR / shapes_name).exists():
    tools.get_zip_extract(uri=source_zip, filename=filename, output_path=SHAPE_DIR)
else:
    print("Already exists")
Already exists
In [10]:
shapes = gp.read_file(SHAPE_DIR / shapes_name)
shapes = shapes.to_crs("EPSG:4326")
In [11]:
ax = shapes.plot(color='none', edgecolor='black', linewidth=0.5)
ax = gdf.plot(ax=ax)
ax.set_axis_off()
buffer = 0.5
minx, miny, maxx, maxy = gdf.total_bounds
ax.set_xlim(minx-buffer, maxx+buffer)
ax.set_ylim(miny-buffer, maxy+buffer)
Out[11]:
(50.28028, 52.395277777)

Highlight/Select all in Region

We want to filter those events whose location falls within the state boundary (Nevada, Saxony)

In [13]:
if loc_name == "Nevada":
    state_name = "Nevada"
    col_name = "NAME"
elif loc_name == "Leipzig":
    state_name = "Sachsen"
    col_name = "GEN"
In [14]:
sel_geom = shapes[shapes[col_name]==state_name].copy()
In [16]:
tools.drop_cols_except(df=sel_geom, columns_keep=["geometry", col_name])
sel_geom.rename(columns={col_name: "country"}, inplace=True)
In [17]:
gdf_overlay = gp.overlay(
    gdf, sel_geom,
    how='intersection')
In [18]:
ax = shapes.plot(color='none', edgecolor='black', linewidth=0.5)
ax = gdf.plot(ax=ax)
ax = gdf_overlay.plot(ax=ax, color='red')
ax.set_axis_off()
buffer = 1
minx, miny, maxx, maxy = gdf.total_bounds
ax.set_xlim(minx-buffer, maxx+buffer)
ax.set_ylim(miny-buffer, maxy+buffer)
Out[18]:
(49.78028, 52.895277777)
In [19]:
print(f'{len(gdf_overlay)} events queried from wikidata that are located in Nevada')
41 events queried from wikidata that are located in Nevada
In [20]:
gdf_overlay.head(20)
Out[20]:
event eventLabel date location eventDescription country geometry
0 http://www.wikidata.org/entity/Q24730444 adlr.link NaT Point(12.368194444 51.3325) web portal of the Specialised Information Serv... Sachsen POINT (12.36819 51.33250)
1 http://www.wikidata.org/entity/Q65952781 Staatliche Studienakademie Leipzig NaT Point(12.30261 51.31028) None Sachsen POINT (12.30261 51.31028)
2 http://www.wikidata.org/entity/Q96623670 Staatliche Studienakademie Riesa NaT Point(13.28919 51.31631) None Sachsen POINT (13.28919 51.31631)
3 http://www.wikidata.org/entity/Q828773 Berufsakademie Glauchau NaT Point(12.5567 50.8228) educational institution Sachsen POINT (12.55670 50.82280)
4 http://www.wikidata.org/entity/Q19963896 N’Ostalgiemuseum NaT Point(12.37845 51.342030555) museum in Germany Sachsen POINT (12.37845 51.34203)
5 http://www.wikidata.org/entity/Q1082822 2008 German motorcycle Grand Prix 2008-07-13 00:00:00+00:00 Point(12.6887 50.7915) None Sachsen POINT (12.68870 50.79150)
6 http://www.wikidata.org/entity/Q1682931 Leipziger Kleinmesse NaT Point(12.34305556 51.34055556) volksfestartige Veranstaltung in Leipzig Sachsen POINT (12.34306 51.34056)
7 http://www.wikidata.org/entity/Q14544300 Nachtdigital NaT Point(13.09080833 51.40503611) Musikfestival für Techno und House Sachsen POINT (13.09081 51.40504)
8 http://www.wikidata.org/entity/Q15060352 Th!nk? NaT Point(12.33527778 51.26944444) music festival near Leipzig, Germany Sachsen POINT (12.33528 51.26944)
9 http://www.wikidata.org/entity/Q2311733 Splash! NaT Point(12.81416667 50.83694444) hip hop and reggae festival in Germany Sachsen POINT (12.81417 50.83694)
10 http://www.wikidata.org/entity/Q836514 Wave-Gotik-Treffen NaT Point(12.37472222 51.34027778) music festival in Leipzig. Germany Sachsen POINT (12.37472 51.34028)
11 http://www.wikidata.org/entity/Q1340529 Endless Summer Open Air 1996-01-01 00:00:00+00:00 Point(12.967601 51.5465796) music festival Sachsen POINT (12.96760 51.54658)
12 http://www.wikidata.org/entity/Q60524551 Siege of Gana NaT Point(13.216666666 51.25) 929 CE German-Slavic military conflict Sachsen POINT (13.21667 51.25000)
13 http://www.wikidata.org/entity/Q815212 Siege of Torgau 1813-10-18 00:00:00+00:00 Point(13.005555555 51.560277777) 1813 siege during the War of the Sixth Coalition Sachsen POINT (13.00556 51.56028)
14 http://www.wikidata.org/entity/Q1069580 Chemnitz Linux Days NaT Point(12.92972222 50.81305556) event sequence Sachsen POINT (12.92972 50.81306)
15 http://www.wikidata.org/entity/Q107157137 1995 Breitenau rail accident 1995-05-23 00:00:00+00:00 Point(13.1585209 50.8387995) Kollision zweier Reisezüge mit einem Bagger im... Sachsen POINT (13.15852 50.83880)
16 http://www.wikidata.org/entity/Q571730 Leipzig Book Fair NaT Point(12.40277778 51.39666667) recurring event Sachsen POINT (12.40278 51.39667)
17 http://www.wikidata.org/entity/Q15110205 Eisenbahnunfall von Braunsdorf 1913-12-14 00:00:00+00:00 Point(13.0241 50.8884) Eisenbahnunfall nach Bergrutsch im Jahr 1913 b... Sachsen POINT (13.02410 50.88840)
18 http://www.wikidata.org/entity/Q228536 Eisenbahnunfall von Schweinsburg-Culten 1972-10-30 00:00:00+00:00 Point(12.36544 50.78028) train wreck Sachsen POINT (12.36544 50.78028)
19 http://www.wikidata.org/entity/Q1312143 Bornitz train collision 1956-02-25 00:00:00+00:00 Point(13.176 51.3027) train wreck Sachsen POINT (13.17600 51.30270)

Store results as CSV

In [21]:
gdf_overlay.to_csv(OUTPUT / f"wikidata_events_{loc_name.lower()}.csv")

Create notebook HTML

In [4]:
!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./03_wikidata_event_query.ipynb \
    --output 03_wikidata_event_query_{loc_name.lower()} \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&-
In [ ]: