Reddit query using pmaw + praw

Alexander Dunkel, Institute of Cartography, TU Dresden

•••
Out[80]:

Last updated: Apr-26-2023, Carto-Lab Docker Version 0.13.0

This is part 2 of the Reddit API notebooks. In the first notebook (notebook.html), we used the official Reddit API. Since this is limited to the latest 1000 entries, we use the Pushshift.io API here to retrieve all posts for a given subreddit:

Prepare environment

We install dependencies required to run this notebook in the cells below.

In [1]:
import os
import time
import sys
import logging
import json
import calendar
import pprint
import datetime as dt
from pathlib import Path
from typing import List, Tuple, Dict, Optional, Set
from IPython.display import clear_output, display, HTML, Markdown
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
from modules.base.tools import display_file

Activate autoreload of changed python files:

In [2]:
%load_ext autoreload
%autoreload 2

Parameters

Define initial parameters that affect processing

In [3]:
WORK_DIR = Path.cwd().parents[0] / "tmp"     # Working directory
OUTPUT = Path.cwd().parents[0] / "out"       # Define path to output directory (figures etc.)
In [4]:
WORK_DIR.mkdir(exist_ok=True)
OUTPUT.mkdir(exist_ok=True)

Environment setup

Install pmaw from pip to the existing praw env.

In [5]:
%%bash
if [ ! -d "/envs/praw/lib/python3.9/site-packages/pmaw" ]; then
    /envs/praw/bin/python -m pip install pmaw > /dev/null 2>&1
else
  echo "Already installed."
fi
# link
if [ ! -d "/root/.local/share/jupyter/kernels/praw_env" ]; then
    echo "Linking environment to jupyter"
    /envs/praw/bin/python -m ipykernel install --user --name=praw_env
else
  echo "Already linked."
fi
Already installed.
Already linked.
•••
List of package versions used in this notebook
package python pmaw praw
version 3.9.16 3.0.0 7.7.0

Pushshift API

In [7]:
from pmaw import PushshiftAPI
API = PushshiftAPI()

api.praw can be used to enrich the Pushshift items retrieved with metadata directly from Reddit

In [8]:
from dotenv import load_dotenv
load_dotenv(
    Path.cwd().parents[0] / '.env', override=True)
Out[8]:
True
In [9]:
CLIENT_ID = os.getenv("CLIENT_ID")
CLIENT_SECRET = os.getenv("CLIENT_SECRET")
USER_AGENT = os.getenv("USER_AGENT")
REFRESH_TOKEN = os.getenv("REFRESH_TOKEN")
In [10]:
import praw
REDDIT = praw.Reddit(
    client_id=CLIENT_ID, 
    client_secret=CLIENT_SECRET,
    user_agent=USER_AGENT,
    refresh_token=REFRESH_TOKEN
)
In [11]:
API.praw = REDDIT

Search submissions

In [12]:
before = int(dt.datetime(2021,4,11,0,0).timestamp())
after = int(dt.datetime(2021,1,1,0,0).timestamp())

We will query the following subreddits (based on Wikipedia's list, sorted by visitors descending):

Click
  • r/yosemite
  • r/yellowstone
  • r/GlacierNationalPark
  • r/ZionNP, r/ZionNationalPark/
  • r/GSMNP (Great Smoky Mountains National Park)
  • r/grandcanyon, r/GrandCanyonNP, r/Grandcanyonhiking
  • r/RockyMountain
  • r/acadianationalpark
  • r/GrandTetonNatlPark, r/GrandTeton
  • r/IndianaDunes
  • r/JoshuaTree
  • r/OlympicNationalPark
  • (r/CuyahogaFalls)
  • (r/HotSprings)
  • r/BryceCanyon
  • r/archesnationalpark, r/arches
  • r/NewRiverGorgeNP
  • r/Mount_Rainier
  • r/shenandoah, r/ShenandoahPark
  • r/CapitolReefNP, r/capitolreef
  • r/DeathValleyNP, r/deathvalley
  • r/Sequoia, r/KingsCanyon
  • r/Everglades
  • r/Canyonlands
  • r/haleakala
  • r/CraterLake
  • r/PetrifiedForest
  • r/BigBendNationalPark (deprecated), r/BigBend, BigBendTX
  • r/MammothCave
  • r/Redwoodnationalpark
  • r/KenaiPeninsula
  • r/lassenvolcanic
  • r/CarlsbadCavernsNP
  • r/PinnaclesNP, r/Pinnacles
  • r/virginislands
  • r/GreatBasinStories
  • r/glacier
  • r/isleroyale
  • r/northcascades
  • r/AmericanSamoa

* Crossed entries have been excluded due to too little content (submissions < 50)
** Entries in parentheses are alternative subreddits, or subreddits referring to popular landscape features within National Parks

In [13]:
SUBREDDIT = "RockyMountain"

Get created date and number of members using praw (the original Reddit API):

In [16]:
sub_yosemite = REDDIT.subreddit(SUBREDDIT)
In [17]:
print(f'Created: {dt.datetime.fromtimestamp(sub_yosemite.created_utc):%Y-%m-%d}')
print(f'Subscribers ("Members"): {sub_yosemite.subscribers}')
Created: 2015-06-15
Subscribers ("Members"): 418
In [15]:
(OUTPUT / SUBREDDIT).mkdir(exist_ok=True)
In [18]:
%%time
all_submissions = list(API.search_submissions(subreddit=SUBREDDIT, until=before, since=after))
print(f'{len(all_submissions)}')
4
CPU times: user 275 ms, sys: 27.3 ms, total: 302 ms
Wall time: 2min 55s
•••
Have a look at the available attributes of the submissions object. {'_comments_by_id': {},
'_fetched': False,
'_reddit': ,
'all_awardings': [],
'allow_live_comments': False,
'approved_at_utc': None,
'approved_by': None,
'archived': False,
'author': Redditor(name='wynnanderson'),
'author_flair_background_color': None,
'author_flair_css_class': None,
'author_flair_richtext': [],
'author_flair_template_id': None,
'author_flair_text': None,
'author_flair_text_color': None,
'author_flair_type': 'text',
'author_fullname': 't2_5l7et8w8',
'author_is_blocked': False,
'author_patreon_flair': False,
'author_premium': False,
'awarders': [],
'banned_at_utc': None,
'banned_by': None,
'can_gild': True,
'can_mod_post': False,
'category': None,
'clicked': False,
'comment_limit': 2048,
'comment_sort': 'confidence',
'content_categories': None,
'contest_mode': False,
'created': 1610463317.0,
'created_utc': 1610463317.0,
'discussion_type': None,
'distinguished': None,
'domain': 'i.redd.it',
'downs': 0,
'edited': False,
'gilded': 0,
'gildings': {},
'hidden': False,
'hide_score': False,
'id': 'kvt53a',
'is_created_from_ads_ui': False,
'is_crosspostable': True,
'is_meta': False,
'is_original_content': False,
'is_reddit_media_domain': True,
'is_robot_indexable': True,
'is_self': False,
'is_video': False,
'likes': None,
'link_flair_background_color': '',
'link_flair_css_class': None,
'link_flair_richtext': [],
'link_flair_text': None,
'link_flair_text_color': 'dark',
'link_flair_type': 'text',
'locked': False,
'media': None,
'media_embed': {},
'media_only': False,
'mod_note': None,
'mod_reason_by': None,
'mod_reason_title': None,
'mod_reports': [],
'name': 't3_kvt53a',
'no_follow': False,
'num_comments': 0,
'num_crossposts': 0,
'num_reports': None,
'over_18': False,
'parent_whitelist_status': None,
'permalink': '/r/RockyMountain/comments/kvt53a/anyone_know_what_this_is/',
'pinned': False,
'post_hint': 'image',
'preview': {'enabled': True,
'images': [{'id': '6OUXEnOWiQphSUHK2lTHfn0R8uqhVXTRswvgLbETn38',
'resolutions': [{'height': 144,
'url': 'https://preview.redd.it/uvhwuazq0xa61.jpg?width=108&crop=smart&auto=webp&v=enabled&s=c36482943b8bb3045dec40a4b6599922ae58b71a',
'width': 108},
{'height': 288,
'url': 'https://preview.redd.it/uvhwuazq0xa61.jpg?width=216&crop=smart&auto=webp&v=enabled&s=35d574419d0acaf80b1f5dc73ad5025cb734499e',
'width': 216},
{'height': 426,
'url': 'https://preview.redd.it/uvhwuazq0xa61.jpg?width=320&crop=smart&auto=webp&v=enabled&s=22c47bebe0f8ef90aeb3dc4411c31da55ed283d8',
'width': 320},
{'height': 853,
'url': 'https://preview.redd.it/uvhwuazq0xa61.jpg?width=640&crop=smart&auto=webp&v=enabled&s=fe1643285c78856fb903b7457c48d1e20093b71a',
'width': 640},
{'height': 1280,
'url': 'https://preview.redd.it/uvhwuazq0xa61.jpg?width=960&crop=smart&auto=webp&v=enabled&s=121551048f810aa156685800aa7d52dd1a710e11',
'width': 960},
{'height': 1440,
'url': 'https://preview.redd.it/uvhwuazq0xa61.jpg?width=1080&crop=smart&auto=webp&v=enabled&s=6f942e3f78bbacde522b0ff0af3f17941f11d85b',
'width': 1080}],
'source': {'height': 3088,
'url': 'https://preview.redd.it/uvhwuazq0xa61.jpg?auto=webp&v=enabled&s=66e6551b4042712458c15eec97ed5fca02e94e40',
'width': 2316},
'variants': {}}]},
'pwls': None,
'quarantine': False,
'removal_reason': None,
'removed_by': None,
'removed_by_category': None,
'report_reasons': None,
'saved': False,
'score': 2,
'secure_media': None,
'secure_media_embed': {},
'selftext': '',
'selftext_html': None,
'send_replies': True,
'spoiler': False,
'stickied': False,
'subreddit': Subreddit(display_name='RockyMountain'),
'subreddit_id': 't5_38om7',
'subreddit_name_prefixed': 'r/RockyMountain',
'subreddit_subscribers': 418,
'subreddit_type': 'public',
'suggested_sort': None,
'thumbnail': 'https://b.thumbs.redditmedia.com/Z6JoV5W0p2ZEQY_0byNlckcMjwHEDF8plEWrTKmIbTc.jpg',
'thumbnail_height': 140,
'thumbnail_width': 140,
'title': 'Anyone know what this is?',
'top_awarded_type': None,
'total_awards_received': 0,
'treatment_tags': [],
'ups': 2,
'upvote_ratio': 1.0,
'url': 'https://i.redd.it/uvhwuazq0xa61.jpg',
'url_overridden_by_dest': 'https://i.redd.it/uvhwuazq0xa61.jpg',
'user_reports': [],
'view_count': None,
'visited': False,
'whitelist_status': None,
'wls': None}

Store data to json

Select fields to store below.

In [39]:
SUBMISSION_FIELDS = (
    'id', 'created_utc', 'author_flair_text', 'author', 'is_original_content', 'is_self', 
    'link_flair_text', 'name', 'num_comments', 'permalink', 'media', 'over_18', 'score', 
    'selftext', 'title', 'total_awards_received', 'url', 'view_count', 'thumbnail')
In [40]:
def get_new_filename(name: Path) -> Path:
    counter = 0
    while Path(name.parent / f'{name.stem}_{counter:02}{name.suffix}').exists():
        counter += 1
    return Path(name.parent / f'{name.stem}_{counter:02}{name.suffix}')
    
def write_json(
        items: List[Dict[str, str]], name: str, output: Path = OUTPUT, 
        submission_fields: Tuple[str] = SUBMISSION_FIELDS, subreddit: str = SUBREDDIT):
    """Filter attributes and write list of dictionaries as json dump to disk"""
    list_of_items = []
    for submission in items:
        sub_dict = {
            field:str(submission[field]) if field == 'author' \
                else submission[field] for field in submission_fields}
        sub_dict['subreddit'] = subreddit
        list_of_items.append(sub_dict)
    filename = output / subreddit / name
    if filename.exists():
        filename = get_new_filename(filename)
    with open(filename, 'w') as f:
        json.dump(list_of_items, f)

Loop through all months

Let's define a few functions to automate looping through months and storing data to json on disk.

We first need a few methods to catch PushShift API RuntimeError, in case not all pushshift shards are active. In these cases, we want to wait for some time and query again, or ask the user how to proceed.

In [25]:
class WaitingException(Exception):
    pass

def get_submissions(
        subreddit: str, before: int, after: int, api: PushshiftAPI) -> List[Dict[str, str]]:
    """Retrieve submissions from PushshiftAPI, given a subreddit and a before and after date"""
    return list(api.search_submissions(subreddit=subreddit, until=before, since=after))
    
def query(
        subreddit: str, before: int, after: int,
        api: PushshiftAPI, num_retries: int = 3) -> List[Dict[str, str]]:
    """Query pushshift API until valid query returned, 
    wait if RuntimeError occurs (not all shards active)
    """
    for ix in range(num_retries):
        try:
            return get_submissions(
                subreddit=subreddit, before=before, after=after, api=api)
        except RuntimeError:
            if ix < (num_retries - 1):
                print(
                    f"Waiting for all PushShift shards "
                    f"to become active again (try {ix+1} of {num_retries})")
                if ix == 0:
                    print("Sleeping for 10 Minutes")
                    time.sleep(600)
                elif ix == 1:
                    print("Sleeping for 2 Hours")
                    time.sleep(7200)
                elif ix == 2:
                    print("Sleeping for 24 Hours")
                    time.sleep(86400)
            else:
                raise WaitingException()

def query_pushshift(
        subreddit: str, before: int, after: int,
        api: PushshiftAPI, num_retries: int = 3) -> List[Dict[str, str]]:
    """Query pushshift API until valid query returned or user explicitly cancels query"""
    while True:
        # as long as either valid query returned or break occurs
        try:
            return query(
                subreddit=subreddit, before=before, after=after,
                api=api, num_retries=3)
        except WaitingException:
            print(
                f"Waited {num_retries} times and still "
                f"not all PushShift shards are active.")
            while True:
                # as long as user enters yes/no
                pick = input("Continue waiting? (y/n)").lower()
                if pick in ('yes', 'y'):
                    print("Repeating query...")
                elif pick in ('no', 'n'):
                    print("Canceling query..")
                    break
                else:
                    print("Select either yes (y) or no (n).")
The main query going backward in time. We want to aim for about 100 submissions returned per query. If it is less, then increase the time period queried, going backwards in time, to query more submissions in the next round.
In [ ]:
def query_time(api: PushshiftAPI = API, subreddit: str = SUBREDDIT,
        start_year: int = 2010, end_year: int = 2023,
        start_month: int = 1, end_month: int = 12, reddit: praw.Reddit = REDDIT) -> int:
    """Given a start and end date, loop through years/months 
    in reverse order and store results to disk.
    Returns total queried count (int) per subreddit.
    """
    total_queried = 0
    zero_received = 0
    query_delta = 4 # weeks
    _exit = False
    created_date = dt.datetime.fromtimestamp(
        reddit.subreddit(subreddit).created_utc)
    min_date = dt.datetime(start_year, start_month, 1, 0, 0)
    if min_date < created_date:
        # check if min date is below creation of subreddit
        min_date = created_date
        print(
            f'[{subreddit}] Limited to lowest available date for this subreddit: '
            f'{min_date.year}-{min_date.month}')
    _, max_day = calendar.monthrange(end_year, end_month)
    max_date = dt.datetime(end_year, end_month, max_day, 0, 0)
    query_max_date = max_date
    query_min_date = max_date - dt.timedelta(weeks=4)
    
    while query_min_date >= min_date and not _exit:
        if query_min_date == min_date:
            # last round
            _exit = True
        # limit query by time period, day accuracy, 1 day overlap buffer
        before = int(
            (dt.datetime(
                query_max_date.year, query_max_date.month, query_max_date.day, 0, 0) + \
                dt.timedelta(days=1)).timestamp())
        after = int(
            dt.datetime(
                query_min_date.year, query_min_date.month, query_min_date.day, 0, 0).timestamp())
        print(
            f'Querying between {query_min_date.year}-{query_min_date.month}-{query_min_date.day} (min) '
            f'and {query_max_date.year}-{query_max_date.month}-{query_max_date.day} (max)')
        all_submissions = query_pushshift(
            subreddit=subreddit, before=before, after=after, api=api, num_retries=3)
        if all_submissions is None:
            _exit = True
        count = len(all_submissions)
        total_queried += count
        # clear jupyter output
        clear_output(wait=True)
        print(
            f'[{subreddit}] Retrieved {count} submissions between '
            f'{query_min_date.year}-{query_min_date.month}-{query_min_date.day} '
            f'and {query_max_date.year}-{query_max_date.month}-{query_max_date.day}. '
            f'Total queried: {total_queried:,}')
        if count == 0:
            zero_received += 1
            if zero_received > 12:
                # if none received for the last 12 queries,
                # exit early
                print(f'Exiting early at {query_min_date.year}-{query_min_date.month}')
                _exit = True
        else:
            zero_received = 0
        if count > 0 and count < 100:
            ratio = int(100/count)
            # increase timespan queried, based on counts retrieved, 
            # but limit range to max 12 months (=52 weeks)
            query_delta += min(ratio*4, 52)
            # query_delta can only go up, but not down
        # write data
        write_json(
            items=all_submissions,
            name=f'reddit_{subreddit}_{query_min_date.year}-{query_min_date.month}-{query_min_date.day}.json',
            subreddit=subreddit)
        # update query_min_date and query_max_date
        query_max_date = query_min_date
        query_min_date = query_max_date - dt.timedelta(weeks=query_delta)
        if query_min_date < min_date:
            query_min_date = min_date
    return total_queried
In [ ]:
SUBREDDIT = 'GrandTetonNatlPark'
(OUTPUT / SUBREDDIT).mkdir(exist_ok=True)
query_time(start_year=2010, start_month=1, end_year=2023, end_month=4, subreddit=SUBREDDIT, reddit=REDDIT)
[GrandTetonNatlPark] Retrieved 0 submissions between 2021-2-7 and 2022-3-6. Total queried: 22
Querying between 2020-1-12 (min) and 2021-2-7 (max)

Retrieve comments for submissions

We can use the original Reddit API to get comments for submissions, since most submissions contain less than 1000 comments.

Have a look at the praw comments docs.

Test this for the submission with id (a2s0cb), which has 18 comments.

In [26]:
submission = REDDIT.submission("a2s0cb")
submission.comments.replace_more(limit=None)
Out[26]:
[]
•••
Click to show comments

Comment #00:

Depends on where you're going, but early May in the high country probably isn't a good time to go camping unless you're ok with collllld nights, dangerous/impossible water crossings, lots of snow still and lots of mud depending on how winter goes this year.

I personally like camping in the back country in the summer months when I don't have to deal with as much shit weather, but this is usually how I do it, though, most people I see out there are carrying all kinds of random stuff

Always rain gear...super lightweight waterproof rain jacket, and a cheaper pair of waterproof pants that I slip over whatever I'm wearing and a cover for my pack

Shorts and a T shirt, water wicking base layer for night or if its chilly...extra pairs of clean socks feel amazing after hiking all day.. Maybe attach a pair of flip flops for chillin around camp. Warm fleece, hat and shades, tiny thing of sunscreen if you think you'll need it

20 degree sleeping bag and tent...

I like using this little thing I bought by MSR. It has a tiny stove, pot thing to boil water, a lid to drink out of...As far as food goes, I really really hate eating dry food at the end of the day so I always make myself a nice meal for dinner.. That could be cup of noodles, mac and cheese deluxe, maybe bring some dehydrated black beans and a couple tortillas, just-add-water mashed potatoes etc..While hiking I eat...candy, basically. Usually snickers and cliff bars because they seem like it's not just straight sugar?? But usually just straight sugar...flavored instant oatmeal for breakfast.

Ummm, Nalgene and water filter, baggies and toilet paper (Pack it out pleeeeaase), lighter, DEET, permit and I think you're pretty much set!

Comment #01:

Where specifically are you going? Huge difference in what you will need depending on what elevation you will be at.

Comment #02:

Is this your first backcountry trip? Or is this just a new climate for you?

I'm happy to give a more detailed response once I have a better idea of the level of detail you'll need.

Also, do you have a specific trip in mind? If you haven't already, familiarize yourself with the permitting process for Yosemite.

Comment #03:

Not sure yet. Obviously we’d like to see el cap, so around that area.

Comment #04:

Hopefully I’ll have been able to get out and do it near me at least a little before I go, but for now I don’t have any experience besides basic hiking.

And we’re still planning the trip part. Obviously we’d want to see el cap, and we have some crazy notion of climbing half dome, but that’s probably a future trip. I didn’t even know that we needed permits.

Comment #05:

Are you talking about car camping (in an establisted campsite you drive to) or backcountry wilderness camping (that you backpack/hike to)

Comment #06:

Well, you can see El Cap from the road driving into the Valley- its not a backcountry destination. In order to hike Half Dome you do need permits and the demand for these permits is higher than the supply.

Perhaps there is some confusion surrounding the word backcountry. Backcountry camping involves hiking to a new site everyday in the backcountry. In Yosemite this means you need to be 4 miles from any trailhead.

There is also front country camping in Yosemite. There are a number of campsites that you can drive your car to.

Which is the one you are trying to do?

Either way, i suggest starting to put together a plan now.

The booking window for May car campsites is quickly approaching- and to get a site it really is important to try to make reservations as soon as they open up.

Additionally, it is a good idea to try to secure reservations if you wish to do a backpacking trip on one of the most popular trailheads. You can request reservations 24 weeks in advance of your trip.

Some of the gear you need will depend on where in Yosemite you camp. Keep in mind higher elevations may still have snow or be very wet.

Start here.

Comment #07:

I didn’t even know that we needed permits.

https://www.nps.gov/yose/planyourvisit/wildpermits.htm

Comment #08:

I may have missed this but the obvious must be said... DO NOT PLAN ON HAVING A CAMP FIRE! Thank you.

Another thing to keep in mind is the rangers will know good routes for tail heads available for what you have, time out and such.

Comment #09:

No. Backcountry. Backpack, setting up a camp, etc.

Comment #10:

In order to hike Half Dome you do need permits and the demand for these permits is higher than the supply.

Half Dome cables most likely won't be up by early May in any case

Comment #11:

Backcountry. And I am planning now, that’s what this is for. I’m just not sure where to start. I’ve never done anything like this and neither have most people I know. This is all very helpful, so thank you.

Comment #12:

Good to know.

Comment #13:

Early May is the tail end of winter/start of spring. Much of the park will still be snowbound.

Are you looking to hike/camp on dry ground or snowshoe/ski and camp on snowpack?

BTW, El Cap is visible from Yosemite Valley. So no need to hike to see that.

Comment #14:

Oh my goodness, excellent point. Thank you for this out!

Comment #15:

OK, so in order to get more specific gear recommendations we would really need to both your intended route and what gear you have already.

I recommend posting a gear list and/ or asking for recommendations for specific items on one of the more general backpacking oriented subs too!

In general, a few tips:

Read up on food storage regulations in Yosemite. For backpacking you will be required to carry a bear resistant canister. You can either rent one or buy your own.

plan for it to be cold at night. That means your sleeping system needs to be up to the task. Keep in mind that your sleeping pad is just as important than your sleeping bag when it comes to staying warm at night as is having dry clothes to sleep in.

A few things you might not thing of regarding cold. . . if your water treatment plan is to filter, put your filter in a zip lock at night and put it in your sleeping bag at night so it does not freeze, alkaline batteries drain quickly in cold temps- use lithium batteries in the cold instead.

As far as clothes, layers are the way to go. The appropriate clothing system will largely be determined by the elevation of your trip. Things like hats and gloves can really help keep you warm at a low cost in space/ weight.

Make sure your backpack fits you well and that it is able to handle the load you plan to carry. The load you plan to carry will vary based on what gear you select, the length of your trip, the nature of your trip and whether heavier/ more gear is required.

This is getting really a little too general, but when it comes to choosing gear for backpacking, most items will have varying degrees of the following three attributes: Low weight/ highly packable, Durability/ features, (relative) affordability. For the purpose of making gear decisions, it is helpful to pick two of these three things to prioritize (which 2 of the 3 matter most to you?)

On the issue of weight, I do recommend avoiding buying an ultralight pack unless you are very dedicated to buying all ultralight gear as such packs really do not handle heavy loads well at all.

As for other gear, you really just need to have "the ten essentials" covered. It is also a good idea to have a "backup" for everything. Keep in mind that a backup does not have to be a completely independent single use item. . . because that would be crazy. You just need to be able to still be covered as far as the "ten essentials" if something breaks either by fixing the thing well enough to use it, or using a different thing for the same purpose. A little duct tape can fix a lot of things, for example, and is not very heavy. As much as possible, everything you carry you should plan to use instead of bringing huge amounts of stuff "just in case," while at the same time being prepared to handle any potential issues (don't bring two of everything but also don't forego some critical "extra" items that might be left unused like a first aid kit and some extra batteries).

Do some research on trails you might like to do in Yosemite and then post again with more specific questions. Keep in mind that it is going to be impossible for anyone to predict the conditions exactly, but we will be able to point out any potential/ likely issues.

Comment #16:

Okay, good to know. We would probably be looking for dry ground. Hopefully spots with more trees. We have lightweight tents, but would prefer to hammock if possible.

Comment #17:

Your options for May are really going to depend on where the snow line is.

Hetch Hetchy along the reservoir is the lowest elevation route in the park and most likely won't have snow. (The bridge may be flooded tho)

Your other options are basically going to be the valley exit routes (Yosemite Falls, Snow Creek, or JMT/Mist Trail) to wherever the snow line is. If this winter turns out to be less than average snowpack, Little Yosemite Valley might be snow free. You'll likely have difficulty finding a 3 day route w/o hitting snow- your best bet will likely be to hug the Valley north rim.

Temperatures should be mild by that time of year and you should be able to get away with 3 season gear. But that is still in the window where a late season storm could possibly hit.

For clothing you should layer and prepare for temperatures as high as 70 degrees in the afternoon and possibly dipping down to near freezing overnight. Food/cooking system is up to your personal preferences but note that you are required to carry a bear can in the park (You can rent these when you pick up your wilderness permit)

Have a look at the available attributes for comments:

In [74]:
display(Markdown(
    f'<details><summary>Click</summary>\n\n```\n\n'
    f'{pprint.pformat(vars(submission.comments[0]), indent=4)}\n\n```\n\n</details>'))
Click

{   '_fetched': True,
    '_reddit': <praw.reddit.Reddit object at 0x7f678131b5b0>,
    '_replies': <praw.models.comment_forest.CommentForest object at 0x7f6781275550>,
    '_submission': Submission(id='a2s0cb'),
    'all_awardings': [],
    'approved_at_utc': None,
    'approved_by': None,
    'archived': False,
    'associated_award': None,
    'author': Redditor(name='stellarnest'),
    'author_flair_background_color': None,
    'author_flair_css_class': None,
    'author_flair_richtext': [],
    'author_flair_template_id': None,
    'author_flair_text': None,
    'author_flair_text_color': None,
    'author_flair_type': 'text',
    'author_fullname': 't2_2kddxlaq',
    'author_is_blocked': False,
    'author_patreon_flair': False,
    'author_premium': False,
    'awarders': [],
    'banned_at_utc': None,
    'banned_by': None,
    'body': "Depends on where you're going, but early May in the high country "
            "probably isn't a good time to go camping unless you're ok with "
            'collllld nights, dangerous/impossible water crossings, lots of '
            'snow still and lots of mud depending on how winter goes this '
            'year.\n'
            '\n'
            'I personally like camping in the back country in the summer '
            "months when I don't have to deal with as much shit weather, but "
            'this is usually how *I* do it, though, most people I see out '
            'there are carrying all kinds of random stuff\n'
            '\n'
            'Always rain gear...super lightweight waterproof rain jacket, and '
            "a cheaper pair of waterproof pants that I slip over whatever I'm "
            'wearing and a cover for my pack\n'
            '\n'
            'Shorts and a T shirt, water wicking base layer for night or if '
            'its chilly...extra pairs of clean socks feel amazing after hiking '
            'all day.. Maybe attach a pair of flip flops for chillin around '
            'camp. Warm fleece, hat and shades, tiny thing of sunscreen if you '
            "think you'll need it\n"
            '\n'
            '20 degree sleeping bag and tent... \n'
            '\n'
            'I like using this little thing I bought by MSR. It has a tiny '
            'stove, pot thing to boil water, a lid to drink out of...As far as '
            'food goes, I really really hate eating dry food at the end of the '
            'day so I always make myself a nice meal for dinner.. That could '
            'be cup of noodles, mac and cheese deluxe, maybe bring some '
            'dehydrated black beans and a couple tortillas, just-add-water '
            'mashed potatoes etc..While hiking I eat...candy, basically. '
            "Usually snickers and cliff bars because they seem like it's not "
            'just straight sugar?? But usually just straight sugar...flavored '
            'instant oatmeal for breakfast.\n'
            '\n'
            'Ummm, Nalgene and water filter, baggies and toilet paper (Pack it '
            "out pleeeeaase), lighter, DEET, permit and I think you're pretty "
            'much set!',
    'body_html': '<div class="md"><p>Depends on where you&#39;re going, but '
                 'early May in the high country probably isn&#39;t a good time '
                 'to go camping unless you&#39;re ok with collllld nights, '
                 'dangerous/impossible water crossings, lots of snow still and '
                 'lots of mud depending on how winter goes this year.</p>\n'
                 '\n'
                 '<p>I personally like camping in the back country in the '
                 'summer months when I don&#39;t have to deal with as much '
                 'shit weather, but this is usually how <em>I</em> do it, '
                 'though, most people I see out there are carrying all kinds '
                 'of random stuff</p>\n'
                 '\n'
                 '<p>Always rain gear...super lightweight waterproof rain '
                 'jacket, and a cheaper pair of waterproof pants that I slip '
                 'over whatever I&#39;m wearing and a cover for my pack</p>\n'
                 '\n'
                 '<p>Shorts and a T shirt, water wicking base layer for night '
                 'or if its chilly...extra pairs of clean socks feel amazing '
                 'after hiking all day.. Maybe attach a pair of flip flops for '
                 'chillin around camp. Warm fleece, hat and shades, tiny thing '
                 'of sunscreen if you think you&#39;ll need it</p>\n'
                 '\n'
                 '<p>20 degree sleeping bag and tent... </p>\n'
                 '\n'
                 '<p>I like using this little thing I bought by MSR. It has a '
                 'tiny stove, pot thing to boil water, a lid to drink out '
                 'of...As far as food goes, I really really hate eating dry '
                 'food at the end of the day so I always make myself a nice '
                 'meal for dinner.. That could be cup of noodles, mac and '
                 'cheese deluxe, maybe bring some dehydrated black beans and a '
                 'couple tortillas, just-add-water mashed potatoes etc..While '
                 'hiking I eat...candy, basically. Usually snickers and cliff '
                 'bars because they seem like it&#39;s not just straight '
                 'sugar?? But usually just straight sugar...flavored instant '
                 'oatmeal for breakfast.</p>\n'
                 '\n'
                 '<p>Ummm, Nalgene and water filter, baggies and toilet paper '
                 '(Pack it out pleeeeaase), lighter, DEET, permit and I think '
                 'you&#39;re pretty much set!</p>\n'
                 '</div>',
    'can_gild': True,
    'can_mod_post': False,
    'collapsed': False,
    'collapsed_because_crowd_control': None,
    'collapsed_reason': None,
    'collapsed_reason_code': None,
    'comment_type': None,
    'controversiality': 0,
    'created': 1543868359.0,
    'created_utc': 1543868359.0,
    'depth': 0,
    'distinguished': None,
    'downs': 0,
    'edited': False,
    'gilded': 0,
    'gildings': {},
    'id': 'eb0t864',
    'is_submitter': False,
    'likes': None,
    'link_id': 't3_a2s0cb',
    'locked': False,
    'mod_note': None,
    'mod_reason_by': None,
    'mod_reason_title': None,
    'mod_reports': [],
    'name': 't1_eb0t864',
    'no_follow': False,
    'num_reports': None,
    'parent_id': 't3_a2s0cb',
    'permalink': '/r/Yosemite/comments/a2s0cb/gear/eb0t864/',
    'removal_reason': None,
    'report_reasons': None,
    'saved': False,
    'score': 3,
    'score_hidden': False,
    'send_replies': True,
    'stickied': False,
    'subreddit': Subreddit(display_name='Yosemite'),
    'subreddit_id': 't5_2sbo4',
    'subreddit_name_prefixed': 'r/Yosemite',
    'subreddit_type': 'public',
    'top_awarded_type': None,
    'total_awards_received': 0,
    'treatment_tags': [],
    'unrepliable_reason': None,
    'ups': 3,
    'user_reports': []}

Filter for selected comment fields

In [25]:
COMMENTS_FIELDS = (
    'id', 'created_utc', 'author_flair_text', 'author', 'is_submitter', 
    'name', 'parent_id', 'permalink', 'score', 
    'body', 'total_awards_received', 'ups', 'downs', 'likes')

Traverse stored json and retrieve all comments

The last step is to loop through the stored json for all submissions to retrieve all comments.

In [61]:
from prawcore.exceptions import ServerError, Forbidden, RequestException

def write_list(list_of_items: List[Dict[str, str]], output: Path):
    """Write list of json items as dump to disk"""
    if not list_of_items:
        return
    filename = output / f'reddit_comments.json'
    if filename.exists():
        filename = get_new_filename(filename)
    with open(filename, 'w') as f:
        json.dump(list_of_items, f)
    print(f'Wrote {len(list_of_items)} comments to {filename.name}')

def filter_comments_json(
        items: List[Dict[str, str]], submission_id: str, 
        list_of_items: List[Dict[str, str]], comments_fields: Tuple[str] = COMMENTS_FIELDS):
    """Filter attributes of dictionaries as json per 1000 batch"""
    if not items:
        return
    for comment in items:
        # initialize actual values from lazy fields API
        comment = vars(comment)
        # except for author field, all other fields are already str
        sub_dict = {
            field:str(comment[field]) if field == 'author' else \
                comment[field] for field in comments_fields}
        sub_dict['submission_id'] = submission_id
        list_of_items.append(sub_dict)

def query_comments(submission_id: str, reddit: praw.Reddit, num_retries: int = 3):
    """Query a single submission for all comments"""
    for ix in range(num_retries):
        try:
            submission = reddit.submission(submission_id)
            submission.comments.replace_more(limit=None)
            return submission.comments.list()
        except (ServerError, RequestException) as e:
            print(f"Received {e.__name__}")
            if ix < (num_retries - 1):
                print(
                    f"Waiting for the Reddit API to become responsive again "
                    f"(try {ix+1} of {num_retries})")
                if ix == 0:
                    print("Sleeping for 1 Minutes")
                    time.sleep(60)
                elif ix == 1:
                    print("Sleeping for 10 Minutes")
                    time.sleep(600)
                elif ix == 2:
                    print("Sleeping for 1 Hours")
                    time.sleep(3600)
            else:
                raise WaitingException()
        except Forbidden:
            if ix < (num_retries - 1):
                print(
                    f"Received a Forbidden Exception. "
                    f"(try {ix+1} of {num_retries})")
                if ix == 0:
                    print(f"Trying one more time after 1 Minute.. ")
                    time.sleep(60)
                else:
                    print(f"Skipping entry.. ")
                    pass
            
                
def get_all_comments(
        submission_id: str, list_of_items: List[Dict[str, str]], 
        output: Path, total_ct: int, perc: str, reddit: praw.Reddit) -> int:
    """Get all comments for submission"""
    all_comments = query_comments(submission_id=submission_id, reddit=reddit)
    if all_comments is None:
        return 0
    filter_comments_json(
        items=all_comments, submission_id=submission_id, list_of_items=list_of_items)
    if len(list_of_items) > 1000:
        write_list(list_of_items=list_of_items, output=output)
        list_of_items.clear()
    clear_output(wait=True)
    comments_count = len(all_comments)
    print(
        f'Retrieved {comments_count} comments for {submission_id}. '
        f'Total comments queried: {total_ct+comments_count:,} - {perc} files.', end='\r')
    return comments_count

See if some submissions have already been processed

In [62]:
SUBREDDIT = "GlacierNationalPark"
In [63]:
processed_file = OUTPUT / SUBREDDIT / "00_processed_submissions.txt"
already_processed = set()
if processed_file.exists():
    already_processed = set(line.strip() for line in open(processed_file, "r"))

Loop through all submissions

In [69]:
def get_comments_subreddit(
    already_processed: Set[str], subreddit: str = SUBREDDIT, output: Path = OUTPUT, reddit: praw.Reddit = REDDIT):
    """Parse a list of submissions (json), stored in a folder for a subreddit,
    and retrieve comments as json from Reddit's original API"""
    list_of_items = []
    total_ct = 0
    output_comments = output / subreddit / "comments"
    output_comments.mkdir(exist_ok=True)
    start_with_ix = 0
    files = list(reversed(sorted((output / subreddit).glob("*.json"))))
    print(f"Processing {len(files)} json files for subreddit {subreddit}")
    skipped = 0
    for ix, json_file in enumerate(files):
        if ix < start_with_ix:
            continue
        submissions = json.load(open(json_file, 'r'))
        if len(submissions) == 0:
            continue
        perc = f'{ix} of {len(files)}'
        for submission_json in submissions:
            sub_id = submission_json['id']
            if sub_id in already_processed:
                skipped += 1
                continue
            if skipped:
                print(f'Skipped {skipped} submission ids that have already been processed')
                skipped = 0
            total_ct += get_all_comments(sub_id, list_of_items, output_comments, total_ct, perc, reddit)
            with open(output / subreddit / "00_processed_submissions.txt", "a") as cfile:
                cfile.write(f'{sub_id}\n')
            already_processed.add(sub_id)
        print(f'\nFinished {json_file.name}')
    print(f'Writing remaining')
    write_list(list_of_items=list_of_items, output=output_comments)
    print(f'Finished retrieving all comments for {subreddit}')
In [ ]:
get_comments_subreddit(
    already_processed=already_processed, subreddit=SUBREDDIT, output=OUTPUT)

Make this notebook executable via cli

Create a list of all park reddits to query

In [48]:
PARKS_SUBREDDITS: List[str] = [
    'GrandTetonNatlPark', 'GrandTeton', 'JoshuaTree', 'OlympicNationalPark', 
    'CuyahogaFalls', 'HotSprings', 'BryceCanyon', 'archesnationalpark', 'arches', 
    'NewRiverGorgeNP', 'Mount_Rainier', 'shenandoah', 'ShenandoahPark', 'CapitolReefNP', 
    'DeathValleyNP', 'deathvalley', 'Sequoia', 'Everglades', 'Canyonlands', 'haleakala', 
    'CraterLake', 'BigBendNationalPark', 'BigBend', 'MammothCave', 'Redwoodnationalpark', 
    'KenaiPeninsula', 'lassenvolcanic', 'CarlsbadCavernsNP', 'PinnaclesNP', 'virginislands', 
    'GreatBasinStories', 'glacier', 'isleroyale', 'northcascades', 'AmericanSamoa']

Create a python script and import methods from this notebook. All variables and methods in cells not tagged with active-ipynb will be loaded and available.

Query for submissions:

In [11]:
display_file(Path.cwd().parents[0] / 'py' / 'get_all_submissions.py')
Have a look at get_all_submissions.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Load all submissions for a list of subreddits.
"""
__author__ = "Alexander Dunkel"
__license__ = "GNU GPLv3"

from dotenv import dotenv_values
from typing import List, Dict
from _pmaw import * # import all methods from notebook

config = dotenv_values(Path.cwd().parent / ".env")
PARKS_SUBREDDITS: List = config["SUBREDDITS"].split(",").strip() 
# e.g. ["Dresden", "Berlin", "Bonn", "Chemnitz", "Munich", "Hamburg", "Mannheim", "Heidelberg"]


def main():
    """Main cli method to query all submissions and comments for a list of subreddits"""
    total_dict: Dict[str, int] = {}
    processed_file = OUTPUT / "00_processed_subreddits.txt"
    already_processed = set()
    if processed_file.exists():
        already_processed = set(line.strip() for line in open(processed_file, "r"))
    API.shards_down_behavior = "stop" # default: warn
    for subreddit in PARKS_SUBREDDITS:
        if subreddit in already_processed:
            print(f"Skipping already processed Subreddit {subreddit}")
            continue
        (OUTPUT / subreddit).mkdir(exist_ok=True)
        total_queried = query_time(
            start_year=2010, start_month=1,
            end_year=2023, end_month=4, subreddit=subreddit)
        print(f'Finished {subreddit} with {total_queried} submissions queried')
        total_dict[subreddit] = total_queried
        with open(processed_file, "a") as cfile:
                cfile.write(f'{subreddit}\n')
        already_processed.add(subreddit)

if __name__ == "__main__":
    main()

Query for comments:

In [12]:
display_file(Path.cwd().parents[0] / 'py' / 'get_all_comments.py')
Have a look at get_all_comments.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Load all comments for a list of subreddits and submission ids.
"""
__author__ = "Alexander Dunkel"
__license__ = "GNU GPLv3"

from _pmaw import * # import all methods from notebook

def get_all_comments():
    """Main cli method to query all submission ids for comments, for a list of subreddits"""
    # list comprehension to load all subdirectories for queried subreddits
    # exclude special dirs (e.g. .ipynb_checkpoints)
    queried_subreddits = [
        directory for directory in next(os.walk(OUTPUT))[1] if not directory.startswith('.')]
    processed_file_subreddits = OUTPUT / "00_processed_subreddits_comments.txt"
    already_processed_subreddits = set()
    if processed_file_subreddits.exists():
        already_processed_subreddits = set(line.strip() for line in open(processed_file_subreddits, "r"))
    for ix, subreddit in enumerate(queried_subreddits):
        # check already processed, on a subreddit-id basis
        # this file needs to be cleaned if a subreddit should be queried again (with new submissions)
        if subreddit in already_processed_subreddits:
            print(f"Skipping already processed Subreddit {subreddit}")
            continue
        # check already processed, on a submission-id basis
        processed_file = OUTPUT / subreddit / "00_processed_submissions.txt"
        already_processed = set()
        if processed_file.exists():
            already_processed = set(line.strip() for line in open(processed_file, "r"))
        get_comments_subreddit(
            already_processed=already_processed, subreddit=subreddit, output=OUTPUT)
        print(f'Processed {ix} of {len(queried_subreddits)} subreddits')
        with open(processed_file_subreddits, "a") as cfile:
                cfile.write(f'{subreddit}\n')
        already_processed_subreddits.add(subreddit)

if __name__ == "__main__":
    # if invoked via cli, __name__ == __main___
    get_all_comments()

To run:

  • cd into py/ directory
  • activate praw env (e.g. conda activate praw/)
  • python get_all_submissions.py
  • python get_all_comments.py

Create a zip file with all retrieved jsons

In [38]:
from modules.base.tools import zip_dir
In [31]:
%%time
from datetime import date
today = str(date.today())

zip_file = OUTPUT / SUBREDDIT / f'{today}_{SUBREDDIT}_submissions_all.zip'
if not zip_file.exists():
    zip_dir(OUTPUT / SUBREDDIT, zip_file)
    
zip_file = OUTPUT / SUBREDDIT /f'{today}_{SUBREDDIT}_comments_all.zip'
if not zip_file.exists():
    zip_dir(OUTPUT / SUBREDDIT / "comments", zip_file)
CPU times: user 1.28 s, sys: 42.7 ms, total: 1.32 s
Wall time: 1.33 s

Create notebook HTML

In [8]:
!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./pmaw.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&-
In [ ]: