According to Alexa [1] people spent more time on Reddit than on Facebook, Instagramm or Youtube. Users use Reddit to post questions, share content or ideas and discuss topics. So it is very interesting to extract automatically text data from this web service. We will look how to do this with PRAW – The Python Reddit API Wrapper.[2]
The example of how to get API key and use python PRAW API can be found at How to scrape reddit with python It is however is not adding all comments, that might be attached to submission. Comments can have important information so I decided to build the python script with PRAW API that is modified from above link for adding comments and few minor things.
To get comments we first need to obtain a submission object.
With a submission object we can then like below:
comment_body = "" for comment in submission.comments.list(): print(comment.body) comment_body = comment_body + comment.body + "\n"
If we wanted to output only the body of the top level comments in the thread we could do:
for top_level_comment in submission.comments: print(top_level_comment.body)
Here is the full python script of API example that can get Reddit information including comments. Note that as we only downloading data and not changing anything, we do not need user name and password. But in case you modifying data on reddit, you would need include login information too.
""" To install module: pip install praw """ import praw import pandas as pd from datetime import datetime reddit = praw.Reddit(client_id='xxxxxxxx', \ client_secret='xxxxxxxx', \ user_agent='personal use script') ##username='YOUR_REDDIT_USER_NAME', \ ###password='YOUR_REDDIT_LOGIN_PASSWORD') def get_yyyy_mm_dd_from_utc(dt): date = datetime.utcfromtimestamp(dt) return str(date.year) + "-" + str(date.month) + "-" + str(date.day) subreddit = reddit.subreddit('learnmachinelearning') top_subreddit = subreddit.top(limit=998) topics_dict = { "title":[], "score":[], "id":[], "url":[], \ "comms_num": [], "created": [], "body":[], "z_comments":[]} for submission in top_subreddit: # https://www.reddit.com/r/redditdev/comments/46g9ao/using_praw_to_call_reddit_api_need_help/ topics_dict["title"].append(submission.title) topics_dict["score"].append(submission.score) topics_dict["id"].append(submission.id) topics_dict["url"].append(submission.url) topics_dict["comms_num"].append(submission.num_comments) topics_dict["created"].append(get_yyyy_mm_dd_from_utc(submission.created)) topics_dict["body"].append(submission.selftext) all_comments = submission.comments.list() print (all_comments) # https://praw.readthedocs.io/en/latest/tutorials/comments.html submission.comments.replace_more(limit=None) comment_body = "" for comment in submission.comments.list(): print(comment.body) comment_body = comment_body + comment.body + "\n" topics_dict["z_comments"].append (comment_body) topics_data = pd.DataFrame(topics_dict) topics_data.to_csv('Reddit_data.csv', index=False)
References
1. The top 500 sites on the web
2. PRAW
3. How to scrape reddit with python
4. Tutorials
5. Webscraping Reddit — Python Reddit API Wrapper (PRAW) Tutorial for Windows