Part 1: Social Media Data

Workshop: Social Media, Data Analysis, & Cartograpy, WS 2022/23

Alexander Dunkel, Madalina Gugulica, Institute of Cartography, TU Dresden

There are two ways to start the workshop environment

1. Centrally from the project folder at ZIH (default):

This is the fast way.

In [3]:
!cd .. && sh
Installed kernelspec workshopv2_env in /root/.local/share/jupyter/kernels/workshopv2_env

2. Locally, by installation in the users home folder (currently disabled):

!cd .. && sh
Well done!

This is the first notebook in a series of four notebooks:

  1. Introduction to Social Media data, jupyter and python spatial visualizations
  2. Introduction to privacy issues with Social Media data and possible solutions for cartographers
  3. Specific visualization techniques example: TagMaps clustering
  4. Specific data analysis: Topic Classification

Open these notebooks through the file explorer on the left side.

If you haven't worked with jupyter, these are some tips:
  • Jupyter Lab allows to interactively execute and write annotated code
  • There are two types of cells: Markdown cells contain only text (annotations), Code cells contain only python code
  • Cells can be executed by SHIFT+Enter
  • The output will appear below
  • States of python will be kept in-between code cells: This means that a value assigned to a variable in one cell remains available afterwards
  • This is accomplished with IPython, an interactive version of python
  • Important: The order in which cells are executed does not have to be linear. It is possible to execute any cell in any order. Any code in the cell will use the current "state" of all other variables. This also allows you to update variables.
Some links
This python environment is prepared for spatial data processing/ cartography.
The following is a list of the most important packages, with references to documentation: We will explore some functionality of these packages in this workshop.
If you want to run these notebooks at home, try the IfK Jupyter Docker Container, which includes the same packages.


We are creating several output graphics and temporary files.

These will be stored in the subfolder notebooks/out/.

In [2]:
from pathlib import Path

OUTPUT = Path.cwd() / "out"
Syntax: pathlib.Path() / "out" ? Python pathlib provides a convenient, OS independend access to local filesystems. These paths work independently of the OS used (e.g. Windows or Linux). Path.cwd() gets the current directory, where the notebook is running. See the docs..

To reduce the code shown in this notebook, some helper methods are made available in a separate file.

Load helper module from ../py/module/

In [3]:
import sys

module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
from modules import tools

Activate autoreload of changed python files:

In [4]:
%load_ext autoreload
%autoreload 2

Introduction: VGI and Social Media Data

Broadly speaking, GI and User Generated Content can be classified in the following three categories of data:

  • Authoritative data that follows objective criteria of measurement such as Remote Sensing, Land-Use, Soil data etc.
  • Explicitly volunteered data, such as OpenStreetMap or Wikipedia. This is typically collected by many people, who collaboratively work on a common goal and follow more or less specific contribution guidelines.
  • Subjective information sources
    • Explicit: e.g. Surveys, Opinions etc.
    • Implicit: e.g. Social Media

Social Media data belongs to the third category of subjective information, representing certain views held by groups of people. The difference to Surveys is that there is no interaction needed between those who analyze the data and those who share the data online, e.g. as part of their daily communication.

Social Media data is used in marketing, but it is also increasingly important for understanding people's behaviour, subjective values, and human-environment interaction, e.g. in citizen science and landscape & urban planning.

In this notebook, we will explore basic routines how Social Media and VGI can be accessed through APIs and visualized in python.

Social Media APIs

  • Social Media data can be accessed through public APIs.
  • This will typically only include data that is explicitly made public by users.
  • Social Media APIs exist for most networks, e.g. Flickr, Twitter, or Instagram
Privacy? We'll discuss legal, ethical and privacy issues with Social Media data in the second notebook: 02_hll_intro.ipynb

Instagram Example

  • Retrieving data from APIs requires a specific syntax that is different for each service.
  • commonly, there is an endpoint (a url) that returns data in a structured format (e.g. json)
  • most APIs require you to authenticate, but not all (e.g. Instagram,
But the Instagram API was discontinued!
  • Instagram discontinued their official API in October 2018. However, their Web-API is still available, and can be accessed even without authentication.
  • One rationale is that users not signed in to Instagram can have "a peek" at images, which provides significant attraction to join the network.
  • We'll discuss questions of privacy and ethics in the second notebook.

Load Instagram data for a specific Hashtag.

In [6]:
hashtag = "park"
query_url = f'{hashtag}/?__a=1&__d=dis'
Syntax: f'{}' ? This is called an f-string, a convenient python convention to concat strings and variables.
In [7]:
from IPython.core.display import HTML
display(HTML(tools.print_link(query_url, hashtag)))
  • If you're not signed in: Chances are high that you're seeing a "Login" page. Since we are working in a workshop, only very few requests to Instagram non-login API are allowed.
  • otherwise, you'll see a json object with the latest feed content
  • In the following, we will try to retrieve this json object and display it.

First, try to get the json-data without login. This may or may not work:

In [8]:
import requests

json_text = None
response = requests.get(
    url=query_url, headers=tools.HEADER)
In [9]:
if not response.status_code == 429 and not "/login/" in response.url:
    json_text = response.text
    print("Loaded live json")
Loaded live json

Optionally, write to temporary file:

In [10]:
if json_text:
    with open(OUTPUT / f"live_{hashtag}.json", 'w') as f:

If the the url refers to the "login" page (or status_code 429), access is blocked. In this case, get the sample json:

In [11]:
if not json_text:
    # check if manual json exists
    local_json = [json for json in OUTPUT.glob('*.json')]
    if len(local_json) > 0:
        # read local json
        with open(local_json[0], 'r') as f:
            json_text =
        print("Loaded local json")
Syntax: [x for x in y] ? This is called a list comprehension, a convenient python convention to to create lists (from e.g. generators etc.).

If neither live nor local json has been loaded, load sample json:

In [12]:
if not json_text:
    sample_url = tools.get_sample_url()
    sample_json_url = f'{sample_url}/download?path=%2F&files=park.json'
    response = requests.get(url=sample_json_url)
    json_text = response.text
    print("Loaded sample json")

Turn text into json format:

In [13]:
import json
json_data = json.loads(json_text)

Have a peek at the returned data.

In [14]:
print(json.dumps(json_data, indent=2)[0:550])
  "graphql": {
    "hashtag": {
      "id": "17841563668080313",
      "name": "park",
      "allow_following": false,
      "is_following": false,
      "is_top_media_only": false,
      "profile_pic_url": "",

The json data is nested. Values can be accessed with dictionary keys.

In [15]:
total_cnt = json_data["graphql"]["hashtag"]["edge_hashtag_to_media"].get("count")

    f'''<details><summary>Working with the JSON Format</summary>
    The json data is nested. Values can be accessed with dictionary keys. <br>For example,
    for the hashtag <strong>{hashtag}</strong>, 
    the total count of available images on Instagram is <strong>{total_cnt:,.0f}</strong>.
Working with the JSON Format The json data is nested. Values can be accessed with dictionary keys.
For example, for the hashtag park, the total count of available images on Instagram is 35,531,817.

Another more flexible data analytics interface is available with pandas.DataFrame().

Dataframe ? A pandas dataframe is the typical tabular data format used in python data science. Most data can be directly converted to a DataFrame.
In [16]:
import pandas as pd
pd.set_option("display.max_columns", 4)
df = pd.json_normalize(

0 1 2 3 4 5 6 7 8 9 ... 61 62 63 64 65 66 67 68 69 70
node.comments_disabled False False False False False False False False False False ... False False False False False False False False False False
node.__typename GraphSidecar GraphImage GraphSidecar GraphSidecar GraphSidecar GraphImage GraphImage GraphSidecar GraphImage GraphImage ... GraphSidecar GraphImage GraphImage GraphImage GraphImage GraphImage GraphImage GraphImage GraphImage GraphImage 3011020865957072959 3011020425586398229 3011020350108729570 3011020061113173179 3011020058261409241 3011020033966858383 3011019833512811063 3011019756228398125 3011019428822804704 3011018810678691503 ... 3011007551761896311 3011007459629750660 3011006942528979040 3011006873674678828 3011006281606954015 3010909508804593692 3008457859843441770 3004733103537296022 2799690906925900554 1307223514519974954
node.edge_media_to_caption.edges [{'node': {'text': 'Park days out #outdoors #f... [{'node': {'text': '#japon #tokyo #shinjuku #g... [{'node': {'text': 'Ph. @photographerrome #i... [{'node': {'text': 'Walking in the park in the... [{'node': {'text': '❤️ Mayawati Hathi Park In ... [{'node': {'text': '#instagram @officialviikaa... [{'node': {'text': '昇仙峡  山梨県甲府市 Nikon D7200 Le... [{'node': {'text': 'mát rượiiii #winter #expl... [{'node': {'text': '思い思いに過ごす午後の公園の雰囲気が心地よい 特にこ... [{'node': {'text': '#japon #tokyo #shinjuku #g... ... [{'node': {'text': '👽👹☠️ #instagram #picture ... [{'node': {'text': 'Minim Cove Park #minimcov... [{'node': {'text': 'The Old One 💯❤✌ . . . .... [{'node': {'text': '#city #park #🤟 #jaipur 🤟❣️... [{'node': {'text': 'Happy Weekend!!! . . . . .... [{'node': {'text': '"A smile😊 can change the w... [{'node': {'text': '💚🌴🌿🐢 • • • • • • • #spain ... [{'node': {'text': '#зима #парк #снег #winter ... [{'node': {'text': '#park #photooftheday #phot... [{'node': {'text': '#yaprak#leaf#yıldızparkı#i...
node.shortcode CnJS4ZGqRA_ CnJSx--hdQV CnJSw4rtADi CnJSsriOcS7 CnJSso4P03Z CnJSsSQLeCP CnJSpXkL9o3 CnJSoPlrUwt CnJSjeqyGjg CnJSae-hcav ... CnJP2pSyVN3 CnJP1TfSFmE CnJPtx5rMBg CnJPsxxo4Is CnJPkKXpvwf CnI5j7rJMgc CnAMHvps0Rq Cmy9NaBMpKW CbagAvkNaMK BIkMNxJADQq
node.edge_media_to_comment.count 0 1 0 0 0 1 0 1 0 0 ... 1 3 2 2 1 1 4 1 8 0
node.taken_at_timestamp 1673161680 1673161627 1673161618 1673161584 1673161583 1673161580 1673161556 1673161547 1673161508 1673161434 ... 1673160092 1673160081 1673160020 1673160011 1673159941 1673148405 1672856145 1672412120 1647969185 1470053208
node.dimensions.height 1080 1080 1349 750 1350 1080 1350 1350 636 1080 ... 1350 1349 1080 1350 1167 1080 1350 1350 607 1080
node.dimensions.width 1080 1080 1080 750 1080 1080 1080 1080 1080 1080 ... 1080 1080 1080 1080 1080 1080 1080 1080 1080 1080
node.display_url ...
node.edge_liked_by.count 0 0 0 0 3 7 6 1 1 0 ... 21 5 7 18 3 10 85 457 56 19
node.edge_media_preview_like.count 0 0 0 0 3 7 6 1 1 0 ... 21 5 7 18 3 10 85 457 56 19 362451278 55382941805 4828404946 57188596306 51329646385 8571978839 6189562963 45627325252 277445402 55382941805 ... 3524100073 36970639629 46531162549 44971203472 5352019952 55827212479 8250540909 5748033415 5922530842 1273397417
node.thumbnail_src ...
node.thumbnail_resources [{'src': 'https://scontent-ber1-1.cdninstagram... [{'src': 'https://scontent-ber1-1.cdninstagram... [{'src': 'https://scontent-ber1-1.cdninstagram... [{'src': 'https://scontent-ber1-1.cdninstagram... [{'src': 'https://scontent-ber1-1.cdninstagram... [{'src': 'https://scontent-ber1-1.cdninstagram... [{'src': 'https://scontent-ber1-1.cdninstagram... [{'src': 'https://scontent-ber1-1.cdninstagram... [{'src': 'https://scontent-ber1-1.cdninstagram... [{'src': 'https://scontent-ber1-1.cdninstagram... ... [{'src': 'https://scontent-ber1-1.cdninstagram... [{'src': 'https://scontent-ber1-1.cdninstagram... [{'src': 'https://scontent-ber1-1.cdninstagram... [{'src': 'https://scontent-ber1-1.cdninstagram... [{'src': 'https://scontent-ber1-1.cdninstagram... [{'src': 'https://scontent-ber1-1.cdninstagram... [{'src': 'https://scontent-ber1-1.cdninstagram... [{'src': 'https://scontent-ber1-1.cdninstagram... [{'src': 'https://scontent-ber1-1.cdninstagram... [{'src': 'https://scontent-ber1-1.cdninstagram...
node.is_video False False False False False False False False False False ... False False False False False False False False False False
node.accessibility_caption None None None None None None None None None None ... None None None None None None None None None None

17 rows × 71 columns

View the first few images

First, define a function.

In [17]:
from typing import List
import matplotlib.pyplot as plt

from PIL import Image, ImageFilter
from io import BytesIO

def image_grid_fromurl(url_list: List[str]):
    """Load and show images in a grid from a list of urls"""
    count = len(url_list)
    plt.figure(figsize=(11, 18))
    for ix, url in enumerate(url_list):
        r = requests.get(url=url)
        i =
        resize = (150, 150)
        i = i.resize(resize)
        i = i.filter(ImageFilter.BLUR)
        plt.subplots_adjust(bottom=0.3, right=0.8, top=0.5)
        ax = plt.subplot(3, 5, ix + 1)

Use the function to display images from "node.display_url" column.

In [18]:

Creating Maps

  • Frequently, VGI and Social Media data contains references to locations such as places or coordinates.
  • Most often, spatial references will be available as latitude and logitude (decimal degrees and WGS1984 projection).
  • To demonstrate integration of data, we are now going to query another API,, to get a list of places near certain coordinates.
In [19]:
lat = 51.03711
lng = 13.76318

Get list of nearby places using's API:

In [20]:
query_url = f''
params = {
In [21]:
response = requests.get(
    url=query_url, params=params)
if response.status_code == 200:
    print(f"Query successful. Query url: {response.url}")
Query successful. Query url:
In [22]:
json_data = json.loads(response.text)
print(json.dumps(json_data, indent=2)[0:500])
  "batchcomplete": "",
  "query": {
    "geosearch": [
        "pageid": 114705842,
        "ns": 14,
        "title": "Category:S\u00fcdwestliche Br\u00fchlvase - Zeus als Stier",
        "lat": 51.03728,
        "lon": 13.76341,
        "dist": 24.8,
        "primary": ""
        "pageid": 114703308,
        "ns": 14,
        "title": "Category:Br\u00fchl-Vasen am Palaisteich",
        "lat": 51.037298,
        "lon": 13.763389,
        "dist": 25.5,

Get List of places.

In [23]:
location_dict = json_data["query"]["geosearch"]

Turn into DataFrame.

In [24]:
df = pd.DataFrame(location_dict)
pageid ns title lat lon dist primary
0 114705842 14 Category:Südwestliche Brühlvase - Zeus als Stier 51.037280 13.763410 24.8
1 114703308 14 Category:Brühl-Vasen am Palaisteich 51.037298 13.763389 25.5
2 4712421 14 Category:Großer Garten, Dresden 51.037500 13.763056 44.2
3 38409391 14 Category:Maps of Großer Garten, Dresden 51.037500 13.763056 44.2
4 4312703 14 Category:Palais im Großen Garten 51.037800 13.762800 81.2
In [25]:
(50, 7)

If we have queried 50 records, we have reached the limit specified in our query. There is likely more available, which would need to be queried using subsequent queries (e.g. by grid/bounding box). However, for the workshop, 50 locations are enough.

Modify data.: Replace "Category:" in column title.

  • Functions can be easily applied to subsets of records in DataFrames.
  • although it is tempting, do not iterate through records
  • dataframe vector-functions are almost always faster and more pythonic
In [26]:
df["title"] = df["title"].str.replace("Category:", "")

Turn DataFrame into a GeoDataFrame

GeoDataframe ? A geopandas GeoDataFrame is the spatial equivalent of a pandas dataframe. It supports all operations of DataFrames, plus spatial operations. A GeoDataFrame can be compared to a Shapefile in (e.g.), QGis.
In [27]:
import geopandas as gp
gdf = gp.GeoDataFrame(
    df, geometry=gp.points_from_xy(df.lon,

Set projection, reproject

Projections in Python
  • Most available spatial packages have more or less agreed on a standard format for handling projections in python.
  • The recommended way is to define projections using their epsg ids, which can be found using
  • Note that, sometimes, the projection-string refers to other providers, e.g. for Mollweide, it is "ESRI:54009"
In [28]:
CRS_PROJ = "epsg:3857" # Web Mercator
CRS_WGS = "epsg:4326" # WGS1984 = CRS_WGS # Set projection
gdf = gdf.to_crs(CRS_PROJ) # Project
In [29]:
pageid ns name lat lon dist primary geometry
0 114705842 14 Südwestliche Brühlvase - Zeus als Stier 51.037280 13.763410 24.8 POINT (1532135.793 6627890.774)
1 114703308 14 Brühl-Vasen am Palaisteich 51.037298 13.763389 25.5 POINT (1532133.455 6627893.961)
2 4712421 14 Großer Garten, Dresden 51.037500 13.763056 44.2 POINT (1532096.336 6627929.721)
3 38409391 14 Maps of Großer Garten, Dresden 51.037500 13.763056 44.2 POINT (1532096.336 6627929.721)
4 4312703 14 Palais im Großen Garten 51.037800 13.762800 81.2 POINT (1532067.888 6627982.831)

Display location on a map

  • Maplotlib and contextily provide one way to plot static maps.
  • we're going to show another, interactive map renderer afterwards

Import contextily, which provides static background tiles to be used in matplot-renderer.

In [30]:
import contextily as cx

1. Create a bounding box for the map

In [31]:
x = gdf.loc[0].geometry.x
y = gdf.loc[0].geometry.y

margin = 1000 # meters
bbox_bottomleft = (x - margin, y - margin)
bbox_topright = (x + margin, y + margin)
gdf.loc[0] ?
  • gdf.loc[0] is the loc-indexer from pandas. It means: access the first record of the (Geo)DataFrame.
  • .geometry.x is used to access the (projected) x coordinate geometry (point). This is only available for GeoDataFrame (geopandas)

2. Create point layer, annotate and plot.

  • With matplotlib, it is possible to adjust almost every pixel individual.
  • However, the more fine-tuning is needed, the more complex the plotting code will get.
  • In this case, it is better to define methods and functions, to structure and reuse code.
In [32]:
from matplotlib.patches import ArrowStyle
# create the point-layer
ax = gdf.plot(
    figsize=(10, 15),
# set display x and y limit
    bbox_bottomleft[0], bbox_topright[0])
    bbox_bottomleft[1], bbox_topright[1])
# turn of axes display
# add callouts 
# for the name of the places
for index, row in gdf.iterrows():
    # offset labels by odd/even
    label_offset_x = 30
    if (index % 2) == 0:
        label_offset_x = -100
    label_offset_y = -30
    if (index % 4) == 0:
        label_offset_y = 100
        text=row["name"].replace(' ', '\n'),
        xy=(row["geometry"].x, row["geometry"].y),
        xytext=(label_offset_x, label_offset_y),
        textcoords="offset points",
                "simple, head_length=2, head_width=2, tail_width=.2"), 
    ax, alpha=0.5,