Getting Reddit Data with Python

In the previous post How to Get Submission and Comments with Python Reddit API Wrapper – PRAW I put how to use Python Reddit API Wrapper for getting information from Reddit. In this post we review few more ways to get data from Reddit.

I did search on the web and found the following python script on github. It is using BeautifulSoup python library for parsing HTML and urllib.request python library for opening reddit url.

As per documentation, The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.

Another available example is at Scraping Reddit with Python and BeautifulSoup 4 It is using BeautifulSoup for HTML web parsing. For opening url it is using requests python library. The requests module uses urllib3 under the hood and provides a slightly more higher level and simpler API on top of it. From multiple discussions on the web it is recommended use requests library.

There are many ways to get information from the web. Instead of using information from web pages, we can utilize website RSS (assuming it is available). For this we would need feedparser – python library for parsing Atom and RSS feeds.
To install feedparser run the command: pip install feedparser

Here is the example how to use feedparser to get data from reddit rss link:

import feedparser

d = feedparser.parse('https://www.reddit.com/r/mlquestions.rss')

# print all posts
count = 1
blockcount = 1
for post in d.entries:
    if count % 5 == 1:
       
        print ("-----------------------------------------\n")
        blockcount += 1
    print (post.title + "\n")
    count += 1

Thus we reviewed several more ways to get information from Reddit. We can use these methods for different websites too – just need to replace url link. Below you can find few more links how to do scraping jobs from web pages.
Feel free to put comments or feedback.

References
1. RedditNewsAggregator
2. Scraping Reddit with Python and BeautifulSoup 4
3. A simple Python feedparser script
4. Requests Documentation
5. What is the practical difference between these two ways of making web connections in Python?
6. Beginners guide to Web Scraping: Part 2 – Build a web scraper for Reddit using Python and BeautifulSoup
7. Scraping Reddit data

Python API Example with Wallabag Web Application for Extracting Entries and Quotes

python and wallabag

In the previous post Python API Example with Wallabag Web Application we explored how to connect via Web API to Wallabag and make entry to Wallabag web application. For this we setup API, obtained token via python script and then created entry (added link).

In this post we will extract entries through Web API with python script. From entry we will extract needed information such as id of entry. Then for this id we will look how to extract annotations and quotes.

Wallabag is read it later web application like Pocket or Instapaper. Quotes are some texts that we highlight within Wallabag. Annotations are our notes that we can save together with annotations. For one entry we can have several quotes / annotations. Wallabag is open source software so you can download it and install it locally or remotely on web server.

If you did not setup API you need first setup API to run code below. See previous post how to do this.
The beginning of script should be also same as before – as we need first provide our credentials and obtain token.

Obtaining Entries

After obtaining token we move to actual downloading data. We can obtain entries using below code:

p = {'archive': 0 , 'starred': 0, 'access_token': access}
r = requests.get('{}/api/entries.txt'.format(HOST), p)

p is holding parameters that allow to limit our output.
The return data is json structure with a lot of information including entries. It does not include all entries. It divides entries in set of 30 per page and it provides link to next page. So we can extract next page link and then extract entries again.

Each entry has link, id and some other information.

Obtaining Annotations / Quotes

To extract annotations, quotes we can use this code:

p = {'access_token': access}
link = '{}/api/annotations/' + str(data['_embedded']['items'][3]['id']) + '.txt'
print (link)
r = requests.get(link.format(HOST), p)
data=json.loads(r.text)

Full Python Source Code

Below is full script example:

# Extract entries using wallabag API and Python
# Extract quotes and annotations for specific entry
# Save information to files
import requests
import json

# only these 5 variables have to be set
#HOST = 'https://wallabag.example.org'
USERNAME = 'xxxxxx'
PASSWORD = 'xxxxxx'
CLIENTID = 'xxxxxxxxxxxx'
SECRET = 'xxxxxxxxxxx'
HOST = 'https://intelligentonlinetools.com/wallabag/web'    


gettoken = {'username': USERNAME, 'password': PASSWORD, 'client_id': CLIENTID, 'client_secret': SECRET, 'grant_type': 'password'}
print (gettoken)

r = requests.post('{}/oauth/v2/token'.format(HOST), gettoken)
print (r.content)


access = r.json().get('access_token')

p = {'archive': 0 , 'starred': 0, 'access_token': access}
r = requests.get('{}/api/entries.txt'.format(HOST), p)

data=json.loads(r.text)
print (type(data))


with open('data1.json', 'w') as f:  # writing JSON object
      json.dump(data, f)


for key, value in data.items():
     print (key, value)
     
#Below how to access needed information at page level like next link
#and at entry level like id, url for specific 3rd entry (counting from 0)      
print (data['_links']['next']) 
print (data['pages'])
print (data['page']) 
print (data['_embedded']['items'][3]['id'])  
print (data['_embedded']['items'][3]['url'])  
print (data['_embedded']['items'][3]['annotations'])


p = {'access_token': access}

link = '{}/api/annotations/' + str(data['_embedded']['items'][3]['id']) + '.txt'
print (link)
r = requests.get(link.format(HOST), p)
data=json.loads(r.text)
with open('data2.json', 'w') as f:  # writing JSON object
      json.dump(data, f)

#Below how to access first and second annotation / quote
#assuming they exist 
print (data['rows'][0]['quote']) 
print (data['rows'][0]['text']) 
print (data['rows'][1]['quote'])    
print (data['rows'][1]['text'])

Conclusion

In this post we learned how to use Wallabag API to download entries, annotations and quotes. To do this we first downloaded entries and ids. Then we downloaded annotations and quotes for specific entry id. Additionally we learned some json python and json examples to get needed information from retrieved data.

Feel free to provide feedback or ask related questions.

Web API to Save to Pocket App and Instapaper App

As we surf the web we find a lot of information that we might use later. We use different applications (Pocket app, Instapaper, Diigo, Evernote or other apps) to save links or notes what we find.

While many of the above applications have a lot of great features there still a lot of opportunities to automate some processes using web API that many of applications provide now.

This will allow to extend application functionality and eliminate some manual processes.

For Example: You have about 20 links that you want to send to pocket like application.
Another example: When you add link to one application you may be want also save link or note to Pocket app or to Instapaper application.
Or may be you want automatically (through script) extract links from some web sites and save them to your Pocket app.

In today post we will look at few examples that allow you start to do this. We will check how to use Pocket API and Instapaper API with python programming.

API for Pocket App

pocket API Pocket, previously known as Read It Later, is an application and service for managing a reading list of articles from the Internet. It is available on many different devices including web browsers. (Wikipedia)
There is great post[1] that is showing how to set up API for it. This post has detailed screenshots how to get all the needed identification information for successful login.

In summary you need get online consumer key for your api application then obtain token code via script. Then you can access the link that will include token and do authorization of application. After this you can use API for sending links.

Below is the summary python script to send the link to Pocket app including previous steps:

import requests

# can be any for link
redirect_link = "google.com"
consumer_key="xxxxxxxx"
# obtain consume key online
#connect to pocket API to get token code
pocket_api = requests.post('https://getpocket.com/v3/oauth/request',
         data = {'consumer_key':consumer_key,
                 'redirect_uri':redirect_link})

pocket_api.status_code       #if 200, it means all ok.

print(pocket_api.headers)               
print (pocket_api.text)

#remove 'code='
token= pocket_api.text[5:]
print (token)
url="https://getpocket.com/auth/authorize?request_token=" + token + "&redirect_uri=" + redirect_link 

import webbrowser
webbrowser.open_new(url) # opens in default browser
#click on Authorize button in webbrowser

# Once authoration done you can post as below (need uncomment code below)  
"""
pocket_add = requests.post('https://getpocket.com/v3/add',
       data= {'url': 'https://getpocket.com/developer/apps/new',
              'consumer_key':consumer_key,
              'access_token': token})
print (pocket_add.status_code)

"""

API for Instapaper

Instapaper is a bookmarking service owned by Pinterest. It allows web content to be saved so it can be “read later” on a different device, such as an e-reader, smartphone, tablet. (Wikipedia)
Below is the code example how to send link to Instapaper. The code example is based on the script that I found on Internet [2]

import urllib, sys

def error(msg):
sys.exit(msg)

def main():
api = 'https://www.instapaper.com/api/add'

params = urllib.parse.urlencode({
'username' : "actual_user_name",
'password' : "actual_password",
'url' : "https://www.actual_url",
'title' : "actual_title",
'selection' : "description"

}).encode("utf-8")

r = urllib.request.urlopen(api, params)

status = r.getcode()

if status == 201:

print('%s saved as %s' % (r.headers['Content-Location'], r.headers['X-Instapaper-Title']))
elif status == 400:
error('Status 400: Bad request or exceeded the rate limit. Probably missing a required parameter, such as url.')
elif status == 403:
error('Status 403: Invalid username or password.')
elif status == 500:
error('Status 500: The service encountered an error. Please try again later')

if __name__ == '__main__':
main()

References
1. Add Pocket API using Python – Tutorial
2. Instapaper