Alexander Dunkel, Institute of Cartography, TU Dresden
In this notebook, reddit API is used to query posts and comments for selected subreddits (National Parks)
Load dependencies:
import os
from pathlib import Path
import pandas as pd
from typing import List, Tuple, Dict, Optional
from IPython.display import clear_output, display, HTML, Markdown
Activate autoreload of changed python files:
%load_ext autoreload
%autoreload 2
Define initial parameters that affect processing
WORK_DIR = Path.cwd().parents[0] / "tmp" # Working directory
OUTPUT = Path.cwd().parents[0] / "out" # Define path to output directory (figures etc.)
WORK_DIR.mkdir(exist_ok=True)
OUTPUT.mkdir(exist_ok=True)
We use praw, the Python Reddit API Wrapper. Have a look at the Reddit API Rules. Reddit allows 60 requests per minute
. Requests for multiple resources at a time are always better than requests for single-resources in a loop.
There are further limits to the Reddit API introduced recently, which limits us to the top recent 1000 submissions in a subreddit.
We'll first prepare the environment using a --prefix
in Carto-Lab Docker, for persistence.
%%bash
DIR="/envs/praw/"
if [ ! -d "$DIR" ]; then
echo "Installing environment in ${DIR}..."
conda create \
--prefix "$DIR" \
--channel conda-forge \
python=3.9 pip praw ipykernel python-dotenv \
--yes > /dev/null 2>&1
else
echo "Environment already exists."
fi
Install kernelspec to jupyter.
%%bash
if [ ! -d "/root/.local/share/jupyter/kernels/praw_env" ]; then
echo "Linking environment to jupyter"
/envs/praw/bin/python -m ipykernel install --user --name=praw_env
fi
Hit CTRL+F5 and select praw_env
on the top-right corner of jupyter lab.
Check the Authenticating via OAuth praw docs:
.env
fileAdd this to your docker-compoye.yml
:
version: '3.6'
services:
jupyterlab:
ports:
- 127.0.0.1:${REDDIT_TOKEN_WEBPORT:-8063}:8063
Since we are running in Carto-Lab Docker, we want to connect the script in py/modules/obtain_refresh_token.py
(source) from the outside to Docker internal localhost:8063
.
If you're working with JupyterLab on a remote computer, you need to add an SSH tunnel, e.g. ssh user@123.45.67.8 -L :8063:127.0.0.1:8063 -p 22 -N -v
from dotenv import load_dotenv
load_dotenv(
Path.cwd().parents[0] / '.env', override=True)
%%bash
if [ -z "$REFRESH_TOKEN" ]; then
/envs/praw/bin/python {Path.cwd().parents[0]}/py/modules/obtain_refresh_token.py
fi
CLIENT_ID = os.getenv("CLIENT_ID")
CLIENT_SECRET = os.getenv("CLIENT_SECRET")
USER_AGENT = os.getenv("USER_AGENT")
REFRESH_TOKEN = os.getenv("REFRESH_TOKEN")
import praw
reddit = praw.Reddit(
client_id=CLIENT_ID,
client_secret=CLIENT_SECRET,
user_agent=USER_AGENT,
refresh_token=REFRESH_TOKEN
)
print(reddit.read_only)
sub_yosemite = reddit.subreddit("yosemite")
for submission in reddit.subreddit("test").hot(limit=10):
print(submission.title)
for submission in sub_yosemite.new():
print(f'{submission.id}: {submission.title}')
Format submission and get comments for a sample submission id:
display(Markdown(
f'<div style="width:500px"> \n\n**Original submission**:\n> {list(sub_yosemite.new())[2].selftext.replace(f"{os.linesep}{os.linesep}", f"{os.linesep}{os.linesep}>")} \n\n</div>'))
for ix, top_level_comment in enumerate(list(sub_yosemite.new())[2].comments):
display(Markdown(f'<div style="width:500px"> \n\n**Comment #{ix:02}**:\n>> {top_level_comment.body} \n\n</div>'))
First, get the number of maximum posts:
all_submissions = list(sub_yosemite.new(limit=1000))
print(f'{len(all_submissions)}')
There's an API query limit of 1000. If your subreddit has more than 1000 submissions, you need to find another way to retrieve the enteriety os posts/comments.
Have a look at the available attributes:
import pprint
pprint.pprint(vars(all_submissions[0]))
See the different available submission attributes in the PRAW api docs.
We are going to write this to a json first.
Notes:
permalink
& name
of submission are only captured with the url
field if selfpost
, so this will need to be queried, tooimport json
list_of_items = []
submission_fields = (
'id', 'created_utc', 'author_flair_text', 'author', 'is_original_content', 'is_self',
'link_flair_text', 'name', 'num_comments', 'permalink', 'media', 'over_18', 'score',
'selftext', 'title', 'total_awards_received', 'url', 'view_count')
Turn selected field to dictionary and attach values from yosemite values list. author
field needs to be casted to str
, in order to be json serializable.
for submission in all_submissions:
to_dict = vars(submission)
sub_dict = {field:str(to_dict[field]) if field == 'author' else to_dict[field] for field in submission_fields}
list_of_items.append(sub_dict)
print(json.dumps(list_of_items[:3], indent=2))
Write to file
with open(OUTPUT / 'yosemite_submissions.json', 'w') as f:
json.dump(list_of_items, f)
Print the latest timestamp in dataset:
from datetime import datetime
datetime.fromtimestamp(all_submissions[len(all_submissions)-1].created_utc).strftime('%b-%d-%Y')
This means that it is not possible to get all posts for this subreddit using the Reddit API, since we are limited by the newest 1000 posts. An alternative way would be to use the pushshift.io-API.
We continue in the following notebook pmaw.html.
!jupyter nbconvert --to html_toc \
--output-dir=../resources/html/ ./notebook.ipynb \
--template=../nbconvert.tpl \
--ExtractOutputPreprocessor.enabled=False >&- 2>&-