Using Google BigQuery to Obtain Reddit Comment Phrase Counts

programming
sql
Author

TheCoatlessProfessor

Published

November 13, 2018

Introduction

As planning for the next semester kicks into full gear, students are asking an ever-increasing number of questions on the University of Illinois at Urbana-Champaign’s subreddit (/r/uiuc). Some of the questions asked should be squarely directed to an advisor, who is paid to provide guidance on a student’s progress and schedule. Alas, many students opt not to talk with advisors for whatever the reason. Thus, the community has adopted a rallying call by responding to such questions with “ask your advisor” when they arise.

On November 13th, 2018, there was even a Reddit thread that wondered about the number of instances for the phrase “ask your advisor”. The user was a bit curious as a promotional flyer for Engineering Career Services (ECS) Advising said that, “95% of students would recommend ECS Advising to other students.” As a result, my interest was peaked and this post was born.

The post is then split into two sections:

  1. analysis that answers the question
  2. details behind how created the query that obtained the data from Google’s BigQuery service

How many “Ask Your Advisors” were there?

For those only interested in the answer, well… There were only 566 uses of the phrase “Ask your advisor” over the years. The majority came from the /u/IDKAskYourAdvisor account created in December 8, 2016. The top posters of the phrase with at least four posts are:

Usernames Freq
/u/IDKAskYourAdvisor 30
/u/cleverdragon1 11
/u/Mobius118f 8
/u/pissblasta3 8
/u/theillini19 8
/u/[deleted] 7
/u/IlliniTy 6
/u/Mosquite_Leaf 6
/u/CertainTackle 5
/u/DragonZaid 5
/u/uiucrower 5
/u/csdude007 4
/u/GenjoKodo 4
/u/jeffgerickson 4
/u/JRDSandstorm 4
/u/mathuiuc 4
/u/Moi_Username 4
/u/MrAcurite 4
/u/ProgramTheWorld 4
/u/schreiberbj 4
/u/TheFearlessChuaEater 4
/u/UIUCEngineering 4
/u/WUTDO11231235 4

Data Background

Reddit data was compiled and published by Redditor /u/Stuck_In_the_Matrix in the post I have every publicly available Reddit comment for research. ~ 1.7 billion comments @ 250 GB compressed. Any interest in this? on /r/datasets. These data sets were then uploaded onto Google’s BigQuery by Felipe Hoffa (/u/fhoffa), who is a Google Cloud Developer Advocate. He’s written up a nice overview of the data on Medium.

With this being said, we’re interested primarily in looking at the comments portion of the Reddit data. This data is contained under the Shared Data projects of category of fh-bigquery within the reddit_comments. Depending on the year, the data is organized differently since more users came onto the platform leading to more data being generated.

  • From 2005 - 2014, the comment data is stored in a table coded by year.
    • For example, the table 2007 gives comment data collected for the entire 2007 year.
  • From 2015 - Present, the data is broken up by year and the month it was harvested in.
    • For example, the table 2015_01 gives comment data collected in January of 2015.

For more details about table structure and naming, please see the appendix section.

If this kind of data is interesting to you, consider joining the /r/bigquery community.

The Query

The query we’re interested in crafting obtains all comments posted to the /r/UIUC community that contain the “ask your advisor” phrase (regardless of case). Once we harvest all the comments, we can run other analytics on the data including counting observations.

In short, the query we would like to run can be given as.

SELECT author, body, score, downs, created_utc
FROM `fh-bigquery.reddit_comments.20*`
WHERE subreddit = 'UIUC'
AND REGEXP_CONTAINS(body, r'(?i)ask your advisor')

Understanding the Query

Our focus is largely on obtaining comments with a phrase. To identify and capture comments with a specific phrase requires the use of regular expressions to parse through the unstructured comment text. Under BigQuery’s Standard SQL (and not Legacy SQL), we can filter comments using REGEX_CONTAINS(), which uses re2 library. The patten we want to detect is then:

r'(?i)ask your advisor'

where the (?i) provides case-insensitive matches and “ask your advisor” is the phrase. Note, the case-insensitive match allows for variants like “Ask Your Advisor”, “ask your advisor”, or “AsK YoUr AdViSoR” to be treated as the same phrase.

From here, we want to look at all comments from over the years. This means we’ll be processing slightly over ~974.36 GB of data when run and, thus, we will only be able to run the query once-per month as Google BigQuery has a 1TB analysis quota. Now, you can run this more often, but it quickly becomes pricy as each query is about 1TB that cost ~$5. Referencing all comments over the years can be done using an asterisk inside the FROM declaration, e.g.

`fh-bigquery.reddit_comments.20*`

Now, I should be mention that we can optimize this query a bit by restricting the tables iterated over to just the time after Thu Aug 7 08:16:29 2008, which is when /r/UIUC was created. Though, the total savings would be less than 1 GB. So, for the sake of simplicity, we’re just going to use the asterisk.

Appendix

Extra analysis can be found at: https://github.com/coatless/bigquery-reddit

At a later point, I’ll look into writing a more in-depth blog post.

All of the Reddit comment data tables can be viewed at:

https://bigquery.cloud.google.com/dataset/fh-bigquery:reddit_comments

Generally, all the tables have the same schema. Later tables gain one additional field. The schema for 2015_01 is given here.

Field Name Data Type Mode
body STRING NULLABLE
score_hidden BOOLEAN NULLABLE
archived BOOLEAN NULLABLE
name STRING NULLABLE
author STRING NULLABLE
author_flair_text STRING NULLABLE
downs INTEGER NULLABLE
created_utc INTEGER NULLABLE
subreddit_id STRING NULLABLE
link_id STRING NULLABLE
parent_id STRING NULLABLE
score INTEGER NULLABLE
retrieved_on INTEGER NULLABLE
controversiality INTEGER NULLABLE
gilded INTEGER NULLABLE
id STRING NULLABLE
subreddit STRING NULLABLE
ups INTEGER NULLABLE
distinguished STRING NULLABLE
author_flair_css_class STRING NULLABLE