Wikidata: Events SPARQL Query

Alexander Dunkel, Institute of Cartography, TU Dresden

•••
Out[1]:

Last updated: Jul-26-2023, Carto-Lab Docker Version 0.14.0

Visualization of events (for Nevada example) queried from Wikidata using SPARQL.

Originally, we intended to discuss the idea of "Event Inventories" in the conference paper. However, this part was cut due to limited space and additional work required. This is part of the draft text that was removed, describing results below:

Finally, we explored the idea of "event inventories" based on the explicit inclusion of structured VGI. We queried all Wikipedia entities of the type "occurrence" with an explicit spatial reference within two areas covering Nevada, USA, and the state of Saxony, Germany. A large number of classical event types were found, such as festivals, sports events, and city fairs, or related sites of historical sieges. However, the results also include many events that would be difficult to consider as part of a temporal landscape scenic resource inventory, such as accidents, wildfire sites, or homicides and school shootings. We have not included these results here because more work is needed to integrate these data.

Arthur et al. (1977) first used the category of "descriptive inventory" to describe methods in which features that are thought to contribute to the visual character of a landscape are first systematically recorded and then aggregated and related to each other to estimate an overall value. These methods are still used in practice for landscape character assessment, mainly because of their ease of use. As a temporal counterpart, we propose "event inventories" as a solution to filter for known temporal features of landscapes at different levels of specificity, such as the waterfalls in Yosemite that are most impressive in spring, or the regular pattern of California poppies or Nevada deserts in bloom. As a starting point to mitigate the challenge of varying levels of specificity in temporal landscape scenic resources, curated event inventories, such as those derived from Wikipedia, can be used. Such positive filter lists can then be explored and monitored using customized workflows and integrated social media and VGI data. In the fields of landscape and urban planning, event inventories can help to better understand the unique transient characteristics of places, areas and landscapes, to protect and develop specific ephemeral scenic values, or to propose actions to change negative influences.

Preparations

Create environment

In [4]:
!python -m venv /envs/wikidata_venv

Install qwikidata in a venv and link the Python Kernel to Jupyter Lab.

In [9]:
%%bash
if [ ! -d "/envs/wikidata_venv/lib/python3.10/site-packages/qwikidata" ]; then
    /envs/wikidata_venv/bin/python -m pip install qwikidata ipykernel pandas > /dev/null 2>&1
else
  echo "Already installed."
fi
# link
if [ ! -d "/root/.local/share/jupyter/kernels/qwikidata" ]; then
    echo "Linking environment to jupyter"
    /envs/wikidata_venv/bin/python -m ipykernel install --user --name=qwikidata
else
  echo "Already linked."
fi
Already installed.
Linking environment to jupyter
Installed kernelspec qwikidata in /root/.local/share/jupyter/kernels/qwikidata

Hit F5 and select the qwikidata Kernel on the top-right corner of Jupyter Lab.

See the package versions used below.

•••
List of package versions used in this notebook
package python ipykernel pandas qwikidata
version 3.10.12 6.24.0 2.0.3 0.4.2

Query wikidata using SPARQL

import dependencies

In [4]:
import csv
import pandas as pd
from qwikidata.sparql import return_sparql_query_results

Define query:

  • use distance query to Nevada (centroid)
  • filter based on country geometry is done later in Geopandas
  • see SPARQL examples here and here

Parameters

There are two parameters that needs modification, the entity name that is used to get the centroid (location), for filtering based on geodistance (the second parameter).

In [1]:
## Example 1:
loc_name = "Nevada"
entity = "Q1227"
geodistance = 400

## Example 2:
# loc_name = "Leipzig"
# geodistance = 80
# entity = "Q2079" # Leipzig, Germany

In [6]:
sparql_query = f"""
#title: All events in {loc_name}, based on distance query ({geodistance})
SELECT ?event ?eventLabel ?date ?location ?eventDescription
WITH {{
  SELECT DISTINCT ?event ?date ?location
  WHERE {{
    # find events
    wd:{entity} wdt:P625 ?loc_ref. 
    ?event wdt:P31/wdt:P279* wd:Q1190554.
           # wdt:P17 wd:Q30;
    # with a point in time or start date
    OPTIONAL {{ ?event wdt:P585 ?date. }}
    OPTIONAL {{ ?event wdt:P580 ?date. }}
    ?event wdt:P625 ?location.
    FILTER(geof:distance(?location, ?loc_ref) < {geodistance}).
  }}
  LIMIT 5000
}} AS %i
WHERE {{
  INCLUDE %i
  SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en,de" .}}
}}
"""
In [9]:
%%time
result = return_sparql_query_results(sparql_query)
CPU times: user 26.6 ms, sys: 0 ns, total: 26.6 ms
Wall time: 39.8 s

Format and convert to pandas DataFrame

In [10]:
import dateutil.parser

event_list = []
for event in result["results"]["bindings"]:
    date_val = event.get('date')
    if date_val:
        date_val = date_val.get('value')
        date_val = pd.to_datetime(dateutil.parser.parse(date_val), errors = 'coerce')
    event_desc = event.get('eventDescription')
    if event_desc:
        event_desc = event['eventDescription']['value']
    event_tuple = (
        event['event']['value'],
        event['eventLabel']['value'],
        date_val,
        event['location']['value'],
        event_desc)
    event_list.append(event_tuple)
In [11]:
df = pd.DataFrame(event_list, columns=result['head']['vars'])
In [12]:
df.head()
Out[12]:
event eventLabel date location eventDescription
0 http://www.wikidata.org/entity/Q116448291 California Revealed NaT Point(-121.49633 38.575783) online project of archival resources
1 http://www.wikidata.org/entity/Q29098186 Hilton Grand Vacations Club NaT Point(-115.161261 36.140165) hotel in Las Vegas, Nevada
2 http://www.wikidata.org/entity/Q29098186 Hilton Grand Vacations Club NaT Point(-115.160386 36.140174) hotel in Las Vegas, Nevada
3 http://www.wikidata.org/entity/Q4602566 2004 Bridgestone 400 2004-09-25 00:00:00+00:00 Point(-115.01112 36.27134) motor car race
4 http://www.wikidata.org/entity/Q16274840 1964 LPGA Championship 1964-01-01 00:00:00+00:00 Point(-115.125 36.128) golf tournament
In [13]:
print(len(df))
328

Store to disk

In [14]:
from pathlib import Path
OUTPUT = Path.cwd().parents[0] / "out" 
df.to_pickle(OUTPUT / f"wikidata_events_{loc_name.lower()}.pkl") 

Visualize on a map

Select worker_env as the visualization environment.

In [2]:
%load_ext autoreload
%autoreload 2

Load dependencies

In [3]:
import sys
import pandas as pd
import geopandas as gp
from pathlib import Path
from shapely.geometry import Point
from shapely import wkt
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
from modules.base import tools
•••
List of package versions used in this notebook
package python Shapely geopandas pandas
version 3.9.15 1.7.1 0.13.2 2.0.3
In [5]:
OUTPUT = Path.cwd().parents[0] / "out" 
df = pd.read_pickle(OUTPUT / f"wikidata_events_{loc_name.lower()}.pkl") 
In [6]:
CRS_WGS = "epsg:4326"

df['geometry'] = df.location.apply(wkt.loads)
gdf = gp.GeoDataFrame(df, crs=CRS_WGS)

Get Shapefile for US States/ Germany

In [7]:
if loc_name == "Nevada":
    source_zip = "https://www2.census.gov/geo/tiger/GENZ2018/shp/"
    filename = "cb_2018_us_state_5m.zip"
    shapes_name = "cb_2018_us_state_5m.shp"
elif loc_name == "Leipzig":
    source_zip = "https://daten.gdz.bkg.bund.de/produkte/vg/vg2500/aktuell/"
    filename = "vg2500_12-31.utm32s.shape.zip"
    shapes_name = "vg2500_12-31.utm32s.shape/vg2500/VG2500_LAN.shp"
In [8]:
SHAPE_DIR = (OUTPUT / "shapes")
SHAPE_DIR.mkdir(exist_ok=True)

if not (SHAPE_DIR / shapes_name).exists():
    tools.get_zip_extract(uri=source_zip, filename=filename, output_path=SHAPE_DIR)
else:
    print("Already exists")
Already exists
In [9]:
shapes = gp.read_file(SHAPE_DIR / shapes_name)
shapes = shapes.to_crs("EPSG:4326")
In [10]:
ax = shapes.plot(color='none', edgecolor='black', linewidth=0.5)
ax = gdf.plot(ax=ax)
ax.set_axis_off()
buffer = 0.5
minx, miny, maxx, maxy = gdf.total_bounds
ax.set_xlim(minx-buffer, maxx+buffer)
ax.set_ylim(miny-buffer, maxy+buffer)
Out[10]:
(35.117, 43.0)

Highlight/Select all in Region

We want to filter those events whose location falls within the state boundary (Nevada, Saxony)

In [11]:
if loc_name == "Nevada":
    state_name = "Nevada"
    col_name = "NAME"
elif loc_name == "Leipzig":
    state_name = "Sachsen"
    col_name = "GEN"
In [12]:
sel_geom = shapes[shapes[col_name]==state_name].copy()
In [13]:
tools.drop_cols_except(df=sel_geom, columns_keep=["geometry", col_name])
sel_geom.rename(columns={col_name: "country"}, inplace=True)
In [14]:
gdf_overlay = gp.overlay(
    gdf, sel_geom,
    how='intersection')
In [15]:
ax = shapes.plot(color='none', edgecolor='black', linewidth=0.5)
ax = gdf.plot(ax=ax)
ax = gdf_overlay.plot(ax=ax, color='red')
ax.set_axis_off()
buffer = 1
minx, miny, maxx, maxy = gdf.total_bounds
ax.set_xlim(minx-buffer, maxx+buffer)
ax.set_ylim(miny-buffer, maxy+buffer)
Out[15]:
(34.617, 43.5)
In [16]:
print(f'{len(gdf_overlay)} events queried from wikidata that are located in {loc_name}')
117 events queried from wikidata that are located in Nevada
In [17]:
gdf_overlay.head(20)
Out[17]:
event eventLabel date location eventDescription country geometry
0 http://www.wikidata.org/entity/Q29098186 Hilton Grand Vacations Club NaT Point(-115.161261 36.140165) hotel in Las Vegas, Nevada Nevada POINT (-115.16126 36.14017)
1 http://www.wikidata.org/entity/Q29098186 Hilton Grand Vacations Club NaT Point(-115.160386 36.140174) hotel in Las Vegas, Nevada Nevada POINT (-115.16039 36.14017)
2 http://www.wikidata.org/entity/Q4602566 2004 Bridgestone 400 2004-09-25 00:00:00+00:00 Point(-115.01112 36.27134) motor car race Nevada POINT (-115.01112 36.27134)
3 http://www.wikidata.org/entity/Q16274840 1964 LPGA Championship 1964-01-01 00:00:00+00:00 Point(-115.125 36.128) golf tournament Nevada POINT (-115.12500 36.12800)
4 http://www.wikidata.org/entity/Q4571929 1965 LPGA Championship 1965-01-01 00:00:00+00:00 Point(-115.125 36.128) golf tournament Nevada POINT (-115.12500 36.12800)
5 http://www.wikidata.org/entity/Q4570360 1961 LPGA Championship 1961-01-01 00:00:00+00:00 Point(-115.125 36.128) golf tournament Nevada POINT (-115.12500 36.12800)
6 http://www.wikidata.org/entity/Q4572336 1966 LPGA Championship 1966-01-01 00:00:00+00:00 Point(-115.125 36.128) golf tournament Nevada POINT (-115.12500 36.12800)
7 http://www.wikidata.org/entity/Q4570751 1962 LPGA Championship 1962-01-01 00:00:00+00:00 Point(-115.125 36.128) golf tournament Nevada POINT (-115.12500 36.12800)
8 http://www.wikidata.org/entity/Q4571127 1963 LPGA Championship 1963-01-01 00:00:00+00:00 Point(-115.125 36.128) golf tournament Nevada POINT (-115.12500 36.12800)
9 http://www.wikidata.org/entity/Q111021622 3-Cushion World Cup 2022-2 2022-01-01 00:00:00+00:00 Point(-115.18708 36.116869) Internationales Karambolageturnier Nevada POINT (-115.18708 36.11687)
10 http://www.wikidata.org/entity/Q24906942 Real World: Go Big or Go Home NaT Point(-115.140444444 36.170972222) thirty-first season of Real World Nevada POINT (-115.14044 36.17097)
11 http://www.wikidata.org/entity/Q7759664 The Real World: Las Vegas, 2002 season 2002-09-17 00:00:00+00:00 Point(-115.194 36.1139) twelth season of The Real World Nevada POINT (-115.19400 36.11390)
12 http://www.wikidata.org/entity/Q7759665 The Real World: Las Vegas, 2011 season 2011-03-09 00:00:00+00:00 Point(-115.154 36.11) twenty-fifth season of The Real World Nevada POINT (-115.15400 36.11000)
13 http://www.wikidata.org/entity/Q104786210 2021 NHL Outdoor Games NaT Point(-119.949 38.968) outdoor National Hockey League game Nevada POINT (-119.94900 38.96800)
14 http://www.wikidata.org/entity/Q25316469 1954 NCAA Skiing Championships 1954-01-01 00:00:00+00:00 Point(-119.872 39.318) None Nevada POINT (-119.87200 39.31800)
15 http://www.wikidata.org/entity/Q15092916 Sparks Middle School shooting NaT Point(-119.76838889 39.55191667) Shooting in Sparks, Nevada, on October 21, 2013 Nevada POINT (-119.76839 39.55192)
16 http://www.wikidata.org/entity/Q15806674 Dreiband-Weltmeisterschaft 1978 1978-01-01 00:00:00+00:00 Point(-115.172816 36.114646) 33. Turnier des Karambolagebillards Nevada POINT (-115.17282 36.11465)
17 http://www.wikidata.org/entity/Q15806682 1986 UMB World Three-cushion Championship 1986-01-01 00:00:00+00:00 Point(-115.172816 36.114646) 41. Turnier des Karambolagebillards Nevada POINT (-115.17282 36.11465)
18 http://www.wikidata.org/entity/Q15806666 Dreiband-Weltmeisterschaft 1970 1970-01-01 00:00:00+00:00 Point(-115.172816 36.114646) 25. Turnier des Karambolagebillards Nevada POINT (-115.17282 36.11465)
19 http://www.wikidata.org/entity/Q6492580 Las Vegas Grind NaT Point(-115.193 36.1166) ls Vegas Grind Festival Nevada POINT (-115.19300 36.11660)

Store results as CSV

In [18]:
gdf_overlay.to_csv(OUTPUT / f"wikidata_events_{loc_name.lower()}.csv")

Create notebook HTML

In [20]:
!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./03_wikidata_event_query.ipynb \
    --output 03_wikidata_event_query_{loc_name.lower()} \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&-
In [ ]: