Reddit api query in Jupyter¶

Alexander Dunkel, Institute of Cartography, TU Dresden

In this notebook, reddit API is used to query posts and comments for selected subreddits (National Parks)

Prepare environment¶

Load dependencies:

import os
from pathlib import Path
import pandas as pd
from typing import List, Tuple, Dict, Optional
from IPython.display import clear_output, display, HTML, Markdown

Activate autoreload of changed python files:

%load_ext autoreload
%autoreload 2

Parameters¶

Define initial parameters that affect processing

WORK_DIR = Path.cwd().parents[0] / "tmp"     # Working directory
OUTPUT = Path.cwd().parents[0] / "out"       # Define path to output directory (figures etc.)

WORK_DIR.mkdir(exist_ok=True)
OUTPUT.mkdir(exist_ok=True)

Environment setup¶

We use praw, the Python Reddit API Wrapper. Have a look at the Reddit API Rules. Reddit allows 60 requests per minute. Requests for multiple resources at a time are always better than requests for single-resources in a loop. There are further limits to the Reddit API introduced recently, which limits us to the top recent 1000 submissions in a subreddit.

We'll first prepare the environment using a --prefix in Carto-Lab Docker, for persistence.

%%bash
DIR="/envs/praw/"
if [ ! -d "$DIR" ]; then
  echo "Installing environment in ${DIR}..."
  conda create \
      --prefix "$DIR" \
      --channel conda-forge \
      python=3.9 pip praw ipykernel python-dotenv \
      --yes > /dev/null 2>&1
else
  echo "Environment already exists."
fi

Environment already exists.

Install kernelspec to jupyter.

%%bash
if [ ! -d "/root/.local/share/jupyter/kernels/praw_env" ]; then
    echo "Linking environment to jupyter"
    /envs/praw/bin/python -m ipykernel install --user --name=praw_env
fi

Hit CTRL+F5 and select praw_env on the top-right corner of jupyter lab.

Reddit API¶

Check the Authenticating via OAuth praw docs:

we use the Code Flow with Refresh Tokens
first, register an app in the reddit api
add id and secret to .env file
load environment variables

Create/Update Refresh Token¶

Add this to your docker-compoye.yml:

version: '3.6'
services:
  jupyterlab:
    ports:
      - 127.0.0.1:${REDDIT_TOKEN_WEBPORT:-8063}:8063

Since we are running in Carto-Lab Docker, we want to connect the script in py/modules/obtain_refresh_token.py (source) from the outside to Docker internal localhost:8063.

If you're working with JupyterLab on a remote computer, you need to add an SSH tunnel, e.g. ssh user@123.45.67.8 -L :8063:127.0.0.1:8063 -p 22 -N -v

from dotenv import load_dotenv
load_dotenv(
    Path.cwd().parents[0] / '.env', override=True)

True

%%bash
if [ -z "$REFRESH_TOKEN" ]; then
    /envs/praw/bin/python {Path.cwd().parents[0]}/py/modules/obtain_refresh_token.py
fi

Now open this url in your browser: https://www.reddit.com/api/v1/authorize?client_id=s15NaeGejkTLI8QxpGvhvg&duration=permanent&redirect_uri=http%3A%2F%2Flocalhost%3A8063&response_type=code&scope=read&state=20385
New refresh token written to .env file (REFRESH_TOKEN).

Authenticate¶

CLIENT_ID = os.getenv("CLIENT_ID")
CLIENT_SECRET = os.getenv("CLIENT_SECRET")
USER_AGENT = os.getenv("USER_AGENT")
REFRESH_TOKEN = os.getenv("REFRESH_TOKEN")

import praw
reddit = praw.Reddit(
    client_id=CLIENT_ID, 
    client_secret=CLIENT_SECRET,
    user_agent=USER_AGENT,
    refresh_token=REFRESH_TOKEN
)

print(reddit.read_only)

False

Sample queries¶

sub_yosemite = reddit.subreddit("yosemite")

for submission in reddit.subreddit("test").hot(limit=10):
    print(submission.title)

test
50% Gray
I'm wondering...
Focus...FOCUS! [GIF]
The Best Frozen Pizza Is:
I really love Chewy Chips Ahoy, with Reese's in them! mmm the best eva!
Confluence: Shift + mouse-select doesn't select the text, it just moves the cursor to the new location. In this video: the extension was on: the cursor moves (no text is selected) the extension was turned off: the text is selected the extension was turned on: the cursor moves (no text is selected)
test
test
Rule

for submission in sub_yosemite.new():
    print(f'{submission.id}: {submission.title}')

12i0620: making reservations for the ahwahnee dining room
12hx1rw: Good luck Half Dome hopefuls!
12huuql: New to camping and Yosemite
12hp10q: Mirror lake April 9 2023
12hp0v4: Hetch hetchy road conditions
12ho4a7: Upper Pines reservation still not cancelled
12hmyhm: Backpacking late June
12hj6qu: Tioga Road Pass opening and closure predictions for 2023
12h5gb3: Mirror Lake April 7 2023
12h4gh5: Next week snow chains and hiking?
12gv59n: Yosemite, August 2022 trip report
12gjmqt: Photography trip in a month - cancel now or wait
12gdgj9: Tioga Pass mid-June 2023?
12g51jh: North Pines campground lottery
12g4c9r: Hiking El Capitan
12fr3y9: The Mountain Shop in Curry Village is having an apparel and gear sale this week
12f75us: Fresno Airport to Yosemite Park
12ez5e4: Request for advice for first-timers!
12eus9b: First time snow show route recomendations?
12enakr: Proposal photographer
12eihbw: Mariposa or Oakhurst.
12e9nwt: Best Transport to/from Yosemite for Out of Town-ers
12e8l7c: May Trip Planning Help
12e61j5: First Come First Serve ?'s
12e37zj: Bucket list National Park! Photos from 04/01-04/06. Happy to answer any questions regarding trails and weather!
12dof2b: From my visit 4/2-4/5
12do9ej: Half dome permit with wilderness permit
12dkr63: Should I visit Yosemite next week or next month?
12dh2iw: Water sources in Yosemite
12db5ky: What's the food situation like during peak season?
12dap16: Taking old parents to Yosemite the first time
12da65i: First time visiting Yosemite - need some tipped and advice
12d8nea: Current Wawona Campground updates/opening info?
12d2gnm: Traveling 27-29 April.
12d2b10: April 5th Tuolumne Ranger report
12cz8wv: Union Square to Yosemite (TOLL roads?)
12csqye: Right by tunnel view 4/2
12cf4y4: The snow is melting and the falls are looking amazing! 4/4/23
12c9p9m: Lost beloved stuffed animal cow on Lower Yosemite Falls trail
12c7x0n: Critique my LYV backpacking loops
12c55ba: Visiting with an elder
12c3c2m: Does sleeping in a backpacking camp invalidate a wilderness permit?
12c2ttt: Upper Pines Campground Update
12c0g13: List of all the top photo locations in the valley?
12bwa85: Advice Needed for Late June Trip
12buu6e: Trip June 9-19
12btivm: Because of the record snow fall, what date do you predict that Tioga Road will open? In 2019 it opened July 1st just to give perspective
12bqyml: Need Help Planning Trip
12bo7oe: Running water mid August
12bmgkc: Yosemite less populated late September?
12bjpqc: Snow in the Valley 03/04
12bjj51: See what CA’s snowpack looks like up close
12b8ol2: Can anyone tell me the name of this peak?
12b7ug9: Tips for camping in the snow?
12b6l89: First time in park tomorrow - yarts itinerary?
12b0bmj: 1 night backpacking loop that is more likely to be open in early June?
12awanm: Yosemite trip this coming weekend (April 8-10)
12anbjy: Advice for Backcountry Trip Early May
12an9qj: Planning an overnight at North Dome, any trail maps for that area?
12aljuu: 01/04 Clouds above the Valley
12a7t79: Spontaneous campervan trip in early May
12a3kgu: Cloud’s Rest 7/1
129z8he: Wedding Brunch
129ulv0: Couple of days ago
129pfv2: Anyone know the results of the April 1 Snowpack Survey for the park?
129oe4c: Dispersed car camping and bear risk
129o9ge: Stargazing on 3/31. Shot w/ iPhone 14P.
129mz6c: Fly Fishing Yosemite
1298096: My trip to yosemite (03/21-24/23)
1294xqv: Lovely visit to Yosemite today.
1294u7g: What equipment do I need for Vernal Fall in April?
1292yno: Does group size of 1 help half dome lottery chances?
1292gap: Would driving into yosemite with a corolla for a day right now be a bad idea, is the park open for public to show up and pay for enterance at gate, and will there be too much snow on roads… I’m from out of state and not sure, thanks!
128u2fo: California snowpack climbs to all-time high. Expect very high freshets in Yosemite Valley this spring
128tvld: 2-days or 3-days in the valley next week?
128nx4u: Looking for a picture of this from directly above.
128ckp9: Car and kayak rental
1289qf0: Disappointed myself with expectations
1289nfu: Weather reported on my phone doesn’t make sense. Yosemite village has the same temperatures as badger pass. The webcams show the valley floor without snow and Badger with a lot. How is this possible?
1282mg3: Ribbon Falls - is it easy to hike there?
127xyjj: Backpacking 3-4 Day Itinerary
127u50h: A topographic view of Yosemite National Park in 1900
127nuoj: Wosky's Pond and Slaughterhouse Meadow, in Yosemite Valley (3/30/23)
127mk54: Half Dome Lottery Application Question
127cimz: Clouds Rest in Mid June
127acnr: Yosemite Valley with little kids - in the snow - Trip Report
1279agk: March 29/30 Trip Report
126zgyq: Glacier Point bus tour
126wcg0: Mid-April - can I still snowshoe?
126wa5w: Is it kind of a “waste” if I don’t get to do clouds rest :(
126upg9: I'm looking for a map.
126pjmd: Night sky photography opportunities besides Glacier Pt?
126o8re: Memorial Day Weekend
1268yvc: Driving in My Old Car to Yosemite Late April 04/22-04/24
1268dt7: ​Getting a national park reservation can be a pain. Here’s what you should know. Timed entry passes have helped with overcrowding, but the new system has created winners and losers.
1260t1f: Some trip planning questions
1260ri2: Wondering about conditions in late April
125wogy: Majestic Squirrel at Glacier Point
125vao8: Old photo of Upper Yosemite Fall seen from John Muir Trail (March 29, 2009).
125a51k: Mirror Lake today before the snow

Format submission and get comments for a sample submission id:

display(Markdown(
    f'<div style="width:500px"> \n\n**Original submission**:\n> {list(sub_yosemite.new())[2].selftext.replace(f"{os.linesep}{os.linesep}", f"{os.linesep}{os.linesep}>")} \n\n</div>'))
for ix, top_level_comment in enumerate(list(sub_yosemite.new())[2].comments):
    display(Markdown(f'<div style="width:500px"> \n\n**Comment #{ix:02}**:\n>> {top_level_comment.body} \n\n</div>'))

Download all posts and comments from a single subreddit¶

First, get the number of maximum posts:

all_submissions = list(sub_yosemite.new(limit=1000))

print(f'{len(all_submissions)}')

921

There's an API query limit of 1000. If your subreddit has more than 1000 submissions, you need to find another way to retrieve the enteriety os posts/comments.

Have a look at the available attributes:

import pprint
pprint.pprint(vars(all_submissions[0]))

{'_comments': <praw.models.comment_forest.CommentForest object at 0x7f0d80a40a90>,
 '_comments_by_id': {},
 '_fetched': True,
 '_reddit': <praw.reddit.Reddit object at 0x7f0d81c2fee0>,
 'all_awardings': [],
 'allow_live_comments': False,
 'approved_at_utc': None,
 'approved_by': None,
 'archived': False,
 'author': Redditor(name='Street_Touch_8732'),
 'author_flair_background_color': None,
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_template_id': None,
 'author_flair_text': None,
 'author_flair_text_color': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_tmlq344z',
 'author_is_blocked': False,
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'banned_at_utc': None,
 'banned_by': None,
 'can_gild': True,
 'can_mod_post': False,
 'category': None,
 'clicked': False,
 'comment_limit': 2048,
 'comment_sort': 'confidence',
 'content_categories': None,
 'contest_mode': False,
 'created': 1681206754.0,
 'created_utc': 1681206754.0,
 'discussion_type': None,
 'distinguished': None,
 'domain': 'self.Yosemite',
 'downs': 0,
 'edited': False,
 'gilded': 0,
 'gildings': {},
 'hidden': False,
 'hide_score': True,
 'id': '12ierfu',
 'is_created_from_ads_ui': False,
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': True,
 'is_video': False,
 'likes': None,
 'link_flair_background_color': '',
 'link_flair_css_class': None,
 'link_flair_richtext': [],
 'link_flair_text': None,
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media': None,
 'media_embed': {},
 'media_only': False,
 'mod_note': None,
 'mod_reason_by': None,
 'mod_reason_title': None,
 'mod_reports': [],
 'name': 't3_12ierfu',
 'no_follow': True,
 'num_comments': 0,
 'num_crossposts': 0,
 'num_duplicates': 0,
 'num_reports': None,
 'over_18': False,
 'parent_whitelist_status': 'all_ads',
 'permalink': '/r/Yosemite/comments/12ierfu/seasonal_yosemite_start_date_nps/',
 'pinned': False,
 'pwls': 6,
 'quarantine': False,
 'removal_reason': None,
 'removed_by': None,
 'removed_by_category': None,
 'report_reasons': None,
 'saved': False,
 'score': 1,
 'secure_media': None,
 'secure_media_embed': {},
 'selftext': 'I was given a tentative start date of April 9; of course, with '
             'the heavy snowfall, I anticipated a delay. Does anyone have a '
             'start date or am I out of the loop?',
 'selftext_html': '<!-- SC_OFF --><div class="md"><p>I was given a tentative '
                  'start date of April 9; of course, with the heavy snowfall, '
                  'I anticipated a delay. Does anyone have a start date or am '
                  'I out of the loop?</p>\n'
                  '</div><!-- SC_ON -->',
 'send_replies': True,
 'spoiler': False,
 'stickied': False,
 'subreddit': Subreddit(display_name='Yosemite'),
 'subreddit_id': 't5_2sbo4',
 'subreddit_name_prefixed': 'r/Yosemite',
 'subreddit_subscribers': 76171,
 'subreddit_type': 'public',
 'suggested_sort': None,
 'thumbnail': 'self',
 'thumbnail_height': None,
 'thumbnail_width': None,
 'title': 'Seasonal Yosemite start date (NPS)',
 'top_awarded_type': None,
 'total_awards_received': 0,
 'treatment_tags': [],
 'ups': 1,
 'upvote_ratio': 1.0,
 'url': 'https://www.reddit.com/r/Yosemite/comments/12ierfu/seasonal_yosemite_start_date_nps/',
 'user_reports': [],
 'view_count': None,
 'visited': False,
 'whitelist_status': 'all_ads',
 'wls': 6}

See the different available submission attributes in the PRAW api docs.

We are going to write this to a json first.

Notes:

permalink & name of submission are only captured with the url field if selfpost, so this will need to be queried, too

import json
list_of_items = []
submission_fields = (
    'id', 'created_utc', 'author_flair_text', 'author', 'is_original_content', 'is_self', 
    'link_flair_text', 'name', 'num_comments', 'permalink', 'media', 'over_18', 'score', 
    'selftext', 'title', 'total_awards_received', 'url', 'view_count')

Turn selected field to dictionary and attach values from yosemite values list. author field needs to be casted to str, in order to be json serializable.

for submission in all_submissions:
    to_dict = vars(submission)
    sub_dict = {field:str(to_dict[field]) if field == 'author' else to_dict[field] for field in submission_fields}
    list_of_items.append(sub_dict)

print(json.dumps(list_of_items[:3], indent=2))

[
  {
    "id": "12ierfu",
    "created_utc": 1681206754.0,
    "author_flair_text": null,
    "author": "Street_Touch_8732",
    "is_original_content": false,
    "is_self": true,
    "link_flair_text": null,
    "name": "t3_12ierfu",
    "num_comments": 0,
    "permalink": "/r/Yosemite/comments/12ierfu/seasonal_yosemite_start_date_nps/",
    "media": null,
    "over_18": false,
    "score": 1,
    "selftext": "I was given a tentative start date of April 9; of course, with the heavy snowfall, I anticipated a delay. Does anyone have a start date or am I out of the loop?",
    "title": "Seasonal Yosemite start date (NPS)",
    "total_awards_received": 0,
    "url": "https://www.reddit.com/r/Yosemite/comments/12ierfu/seasonal_yosemite_start_date_nps/",
    "view_count": null
  },
  {
    "id": "12idfgt",
    "created_utc": 1681202417.0,
    "author_flair_text": null,
    "author": "Solid_Ad884",
    "is_original_content": false,
    "is_self": true,
    "link_flair_text": null,
    "name": "t3_12idfgt",
    "num_comments": 2,
    "permalink": "/r/Yosemite/comments/12idfgt/halfdome_permits/",
    "media": null,
    "over_18": false,
    "score": 1,
    "selftext": "Just got my halfdome permit for July 20th. I listed my group as 3 individuals including myself but recently found out two of my friends can\u2019t make it. Any suggestions for finding people in their early-mid 20\u2019s to hike halfdome?",
    "title": "Halfdome Permits",
    "total_awards_received": 0,
    "url": "https://www.reddit.com/r/Yosemite/comments/12idfgt/halfdome_permits/",
    "view_count": null
  },
  {
    "id": "12id35a",
    "created_utc": 1681201270.0,
    "author_flair_text": null,
    "author": "dogemaster00",
    "is_original_content": false,
    "is_self": true,
    "link_flair_text": null,
    "name": "t3_12id35a",
    "num_comments": 3,
    "permalink": "/r/Yosemite/comments/12id35a/half_dome_permits/",
    "media": null,
    "over_18": false,
    "score": 4,
    "selftext": "Rejected :/\n\nPut down all Sundays in June/July. Curious how easy it is to win Saturday/Sunday permits in the daily lottery.",
    "title": "Half Dome Permits",
    "total_awards_received": 0,
    "url": "https://www.reddit.com/r/Yosemite/comments/12id35a/half_dome_permits/",
    "view_count": null
  }
]

Write to file

with open(OUTPUT / 'yosemite_submissions.json', 'w') as f:
    json.dump(list_of_items, f)

Print the latest timestamp in dataset:

from datetime import datetime
datetime.fromtimestamp(all_submissions[len(all_submissions)-1].created_utc).strftime('%b-%d-%Y')

'Nov-07-2022'

This means that it is not possible to get all posts for this subreddit using the Reddit API, since we are limited by the newest 1000 posts. An alternative way would be to use the pushshift.io-API.

We continue in the following notebook pmaw.html.

Create notebook HTML¶

!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./notebook.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&-