1 API What?

An Application Programming Interface (API) is the means by which a piece of software exposes functionality. Ideally an API is well documented so that application programmers can easily interact with it. This ideal is not always achieved in practice.

We are going to be looking at a specific type of API: an interface exposed by a web site or a Web API.

The practice of publishing APIs has allowed web communities to create an open architecture for sharing content and data. In this way, content that is created in one place can be dynamically posted and updated in multiple locations on the web. For example, APIs allow for:

Online purchases (verification of credit-card data);
Smartphone applications (for accessing Twitter, LinkedIn, Facebook etc.);
Maps with location data (like Yelp); and
Sharing content between social networking sites.

Another term which you are likely to hear in this context is REST or Representational State Transfer. REST is a stateless method for communicating between a client and a server, normally using the HTTP protocol.

1.1 Simple Requests

A request is sent to the server using GET or POST HTTP methods and the server returns the response via raw HTTP. The interface can be extremely simple or it might accept a range of parameters.

Here are the URLs for two GET requests:

https://www.quandl.com/api/v3/datasets/WIKI/AAPL.csv (Apple stock history until yesterday)
http://www.omdbapi.com/?t=Forrest+Gump&y=&plot=short&r=json

Open the links in a browser and the results will either be rendered in the browser window or saved to your Downloads folder.

These are some sites which expose useful APIs:

There are certainly many more. Have a look at ProgrammableWeb to get an idea of just how many.

1.2 API Keys

Although some APIs allow anonymous requests, many require you to register and obtain an API key in order to use their service (or to have less limited access). Generally you can register for an API key at no cost, but in some cases the key is a way to monetise the API.

2 Submitting API Requests with R

The httr package implements functions for a range of HTTP methods, the most important at present are GET() and POST().

library(httr)

2.1 A GET Request

mailtest = GET("https://api.mailtest.in/v1/totallyinvaliddomain.com")

The result of a call to GET() is of class response. The first thing that we should do is check that the request was successful using http_status() to access the HTTP status code.

http_status(mailtest)

$category
[1] "Success"

$reason
[1] "OK"

$message
[1] "Success: (200) OK"

A value of 200 indicates success. Next we check the content of the response.

content(mailtest, as = "text", encoding = "UTF-8")    # Raw JSON data

[1] "{\"code\":\"22\",\"status\":\"INVALID\",\"message\":\"Unregistered Domain\"}"

The response comes as a JSON document. Evidently that email domain is not valid. No surprises there. Let’s try a valid domain.

mailtest = GET("https://api.mailtest.in/v1/google.com")

Then check that the response is valid and look at the response data.

mailtest$status_code

[1] 200

content(mailtest)                                     # Parsed JSON data -> list

$code
[1] "01"

$status
[1] "ACTIVE"

$message
[1] "OK"

Nice!

Let’s do something more useful. We’re going to access the OMDb API from within R. We’ve seen how to use GET(). However, since a GET request is equivalent to simply opening the URL in a browser, there are (at least) two other ways to do this. We can treat the URL as a connection.

readLines("http://www.omdbapi.com/?t=Forrest+Gump&y=&plot=short&r=json")

Or use getURL() from the RCurl package.

library(RCurl)
(forrest.gump  <- getURL("http://www.omdbapi.com/?t=Forrest+Gump&y=&plot=short&r=json"))

[1] "{\"Title\":\"Forrest Gump\",\"Year\":\"1994\",\"Rated\":\"PG-13\",\"Released\":\"06 Jul 1994\",\"Runtime\":\"142 min\",\"Genre\":\"Drama, Romance\",\"Director\":\"Robert Zemeckis\",\"Writer\":\"Winston Groom (novel), Eric Roth (screenplay)\",\"Actors\":\"Tom Hanks, Rebecca Williams, Sally Field, Michael Conner Humphreys\",\"Plot\":\"Forrest Gump, while not intelligent, has accidentally been present at many historic moments, but his true love, Jenny Curran, eludes him.\",\"Language\":\"English\",\"Country\":\"USA\",\"Awards\":\"Won 6 Oscars. Another 37 wins & 51 nominations.\",\"Poster\":\"http://ia.media-imdb.com/images/M/MV5BMTI1Nzk1MzQwMV5BMl5BanBnXkFtZTYwODkxOTA5._V1_SX300.jpg\",\"Metascore\":\"82\",\"imdbRating\":\"8.8\",\"imdbVotes\":\"1,225,276\",\"imdbID\":\"tt0109830\",\"Type\":\"movie\",\"Response\":\"True\"}"

Again the response is a JSON document. It’s a bit of a mess, right? We’ll clean it up and learn more about JSON shortly.

2.2 A POST Request

Interacting via POST is a little more complicated. We’ll demonstrate the basics with httpresponder. A GET request passes information as part of the URL. By contrast, a POST request sends data in the body of the HTTP packet. Check out ?POST.

httpresponder = POST("http://httpresponder.com/ixdatascience.json", body = list(
  course = "Data Science",
  module = "Getting Data from an API"
  ))
content(httpresponder)

$students
[1] 15

$location
[1] "Cape Town"

The payload can take a variety of forms. Run the POST() command above and then visit httpresponder to view the request content as seen by the server.

3 JSON

JSON (JavaScript Object Notation) is a text based data-exchange format designed to be lightweight, easy for computers to parse and simple for humans to understand. It’s the de facto standard for the exchange of data over the interwebs.

JSON has six basic data types:

number (either an integer or a floating point number);
string (Unicode characters enclosed by double quotation marks);
Boolean (true or false);
array (a sequence of items separated by commas and enclosed in square brackets);
object (a sequence of key/value pairs separated by commas and enclosed in curly brackets);
null.

White space is permitted but ignored. Generally JSON data are minified for transmission. This is the case for the examples that we have seen above.

It’s important to know a little about the JSON format. However, in most situations you’ll not be required to interact directly with JSON data: there are libraries to do the heavy lifting for you. In R the best options are the jsonlite, RJSONIO and rjson packages. The functionality in these packages is very similar. We’ll use jsonlite here but you should feel free to experiment with the others.

library(jsonlite)

First let’s convert the JSON document we retrieved from OMDb into a more human friendly format.

prettify(forrest.gump)

{
    "Title": "Forrest Gump",
    "Year": "1994",
    "Rated": "PG-13",
    "Released": "06 Jul 1994",
    "Runtime": "142 min",
    "Genre": "Drama, Romance",
    "Director": "Robert Zemeckis",
    "Writer": "Winston Groom (novel), Eric Roth (screenplay)",
    "Actors": "Tom Hanks, Rebecca Williams, Sally Field, Michael Conner Humphreys",
    "Plot": "Forrest Gump, while not intelligent, has accidentally been present at many historic moments, but his true love, Jenny Curran, eludes him.",
    "Language": "English",
    "Country": "USA",
    "Awards": "Won 6 Oscars. Another 37 wins & 51 nominations.",
    "Poster": "http://ia.media-imdb.com/images/M/MV5BMTI1Nzk1MzQwMV5BMl5BanBnXkFtZTYwODkxOTA5._V1_SX300.jpg",
    "Metascore": "82",
    "imdbRating": "8.8",
    "imdbVotes": "1,225,276",
    "imdbID": "tt0109830",
    "Type": "movie",
    "Response": "True"
}

That’s a lot easier to digest. The document contains an object (key/value pairs). In this case all keys and values are strings.

The reverse operation can be done with minify(). Normally JSON is stored and transferred in an unformatted, minified state since this reduces the size of the document.

We can parse the contents of a JSON document with fromJSON(), which will convert it into a R data structure.

forrest.gump.parsed = fromJSON(forrest.gump)
class(forrest.gump.parsed)

[1] "list"

names(forrest.gump.parsed)

 [1] "Title"      "Year"       "Rated"      "Released"   "Runtime"    "Genre"      "Director"  
 [8] "Writer"     "Actors"     "Plot"       "Language"   "Country"    "Awards"     "Poster"    
[15] "Metascore"  "imdbRating" "imdbVotes"  "imdbID"     "Type"       "Response"

forrest.gump.parsed$Plot

[1] "Forrest Gump, while not intelligent, has accidentally been present at many historic moments, but his true love, Jenny Curran, eludes him."

The toJSON() function will convert a R object into a JSON document.

The pair of functions unserializeJSON() and serializeJSON() perform a similar function, but provide greater fidelity for R objects. A serialised R object should capture essentially all aspects of the data, allowing it to be restored almost perfectly.

A JSON document is an ideal way to store unstructured or semi-structured data. Let’s illustrate this with a simple exercise. Everybody needs to create a personal JSON document. It should have the following minimum information: first named, surname, gender, date of birth. Add to that three other informative fields. For example, it might look like this:

{
  "name": "Eric",
  "surname": "Blair",
  "gender": "Male",
  "birth": "25/06/1903",
  "death": "21/01/1950",
  "books": ["1984", "Animal Farm"],
  "pen_name": "George Orwell"
}

We’ll consolidate the entries for the whole class into a single JSON document.

4 API Packages in R

You often don’t need to have low level interactions with an API because there are many R packages which have wrapped these interactions up into functions. We’ll be focussing on two packages in particular (Quandl and twitteR), but here are some others that you should know about:

Rlinkedin
instaR (Instagram)
ROpenWeatherMap and rwunderground
rplos (search journals from the Public Library of Science)
translate (Google Translate)
WikidataR
datarobot (Predictive Modeling API)
telegram (Twitter’s Telegram Bot API)
gmailr
GuardianR (news from The Observer and The Guardian)
rdrop2 (Dropbox)

4.1 `Quandl` Package

The Quandl package is a wrapper around the Quandl API. Quandl is a vast repository of interesting data.

library(Quandl)

We’ll take a look at stock data for Apple. Visit the dashboard for these data on Quandl. Note that you can directly download the data in a range of formats.

AAPL = Quandl("WIKI/AAPL")              # You can also supply an API key, but it's not mandatory

head(AAPL[, 1:5])

        Date  Open   High    Low  Close
1 2016-05-27 99.44 100.47 99.245 100.35
2 2016-05-26 99.68 100.73 98.640 100.41
3 2016-05-25 98.67  99.74 98.110  99.62
4 2016-05-24 97.22  98.09 96.840  97.90
5 2016-05-23 95.87  97.19 95.670  96.43
6 2016-05-20 94.64  95.43 94.520  95.22

4.2 `twitteR` Package

The twitteR package implements a high level interface to the Twitter API. It’s also worth mentioning streamR, which wraps the Twitter Streaming API.

library(twitteR)

The first step towards interacting with this API is to create an authorisation key. You do this by creating an application here.

Creating a new Twitter application.

Fill in a Name and Desciption. You can put in a place filler like http://www.example.com for the Website (note that this needs to begin with http://). Leave the Callback URL empty. Submit the form.

On the following page go to the Keys and Access Tokens tab and make a note of the API Key and API Secret. Scroll down and create an Access Token. Make a note of the Access Token and Access Token Secret.

Twitter application keys and secrets.

Now that we’ve jumped through those hoops we can connect to the API.

consumer_key = "aYgTU4eYH3v4yzLwbpwFfkGAj"
consumer_secret = "JM126ueWzWNEDvamaGCS09WzijGI4ANcahxeyPujmxayS9gf0z"
access_token = "3320318445-12v4fKY0hUfbK84EpYnkBbojt3DyY6TEdqj6Tma"
access_secret = "MnMNiuMY7UGcH3f25gYzyV8rpqmlo98PbkaggGjMTI15j"
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

You’ll see this message in the Console.

[1] "Using browser based authentication"
Use a local file to cache OAuth access credentials between R sessions?
1: Yes
2: No

Selection:

Select 1.

You should now be ready to interact with the Twitter API. Let’s give it a test run: find 5 recent tweets mentioning “#datascience” originating from within 500 km of Durban.

(tweets <- searchTwitter('#datascience', geocode='-29.8579,31.0292,500km',  n = 5))

[[1]]
[1] "DataWookie: @iXperienceCT #datascience Session 1 kicking off tomorrow! Going to be epic. #rstats #machinelearning @tableau"

[[2]]
[1] "7wData: Doing #datascience: A Kaggle Walkthrough Part 2 – Understanding the #Data\nhttps://t.co/BlqRaEdoFv https://t.co/NqrcebWErH"

[[3]]
[1] "7wData: #datascience set to transform wireless supply chain\nhttps://t.co/mR0KdhmkYZ https://t.co/k0r14TyyoW"

[[4]]
[1] "sentiwire: RT @rlnel: Cool interactive exploration tool for clustering algorithms and outliers | #DataScience #statistics | https://t.co/hDhMCJ1uOb"

[[5]]
[1] "rlnel: Conceptnet Numberbatch: The best word embeddings you can download #DataScience #statistics https://t.co/Pz6NjPGyOH"

Each of the returned items has class status. Check out ?status to find out about the functionality of this class.

tweets[[1]]$getText()

[1] "@iXperienceCT #datascience Session 1 kicking off tomorrow! Going to be epic. #rstats #machinelearning @tableau"

tweets[[1]]$getScreenName()

[1] "DataWookie"

You can turn each of the status objects into a data frame using as.data.frame(). These can then be concatenated to form a nice, tidy data set. You remember how to do this using do.call(), right?

names(as.data.frame(tweets[[1]]))

 [1] "text"          "favorited"     "favoriteCount" "replyToSN"     "created"       "truncated"     "replyToSID"   
 [8] "id"            "replyToUID"    "statusSource"  "screenName"    "retweetCount"  "isRetweet"     "retweeted"    
[15] "longitude"     "latitude"

You can extract user information. Look at ?user to find out about the user class.

hadley = getUser("hadleywickham")
hadley$name

[1] "Hadley Wickham"

hadley$description

[1] "R, data, visualisation."

hadley$location

[1] "Houston, TX"

And, of course, you can also tweet. The result has class status. Take a look at ?status to see associated functionality.

msg = tweet("#ixdatascience Let's get this party started! @iXperienceCT #rstats @tableau")

A status update straight from RStudio.

msg$id

[1] "737149648503201793"

msg$created

[1] "2016-05-30 05:12:26 UTC"

with(msg, list(retweeted, retweetCount))

[[1]]
[1] FALSE

[[2]]
[1] 0

That should be enough to get you started. We’ve really just touched the surface though. The twitteR package has excellent coverage across the Twitter API. You’re only limited by your imagination. Have fun! To get inspired, have a look at what other people have been doing with Twitter and R.

5 Exercises

Identify the various data types in the following JSON document:

{
  "firstName": "Jan",
  "lastName": "van der Merwe",
  "gender": male,
  "height": 1.84,
  "isAlive": true,
  "age": 63,
  "address": {
"streetAddress": "13 Burger Street",
"city": "Kakamas",
"state": "Northern Cape",
"country": "South Africa" 
"postalCode": "8870"
  },
  "phoneNumbers": [
{
  "type": "home",
  "number": "054 365 2727"
},
{
  "type": "mobile",
  "number": "073 239 5730"
}
  ],
  "children": [],
  "spouse": null
}

Use fromJSON() to parse the contents of this JSON document.

Register for an API key on Quandl. THEN DO SOMETHING.
Write a function which uses the MailTest API to validate an email address.
Submit a series of GET requests to MailTest and use the times field in the response to gather statistics on the distribution of DNS lookup and connection times.
Use the New York Times’ Article Search API to find articles published during 2015 which relate to Cape Town. AND THEN DO WHAT? The New York Times also exposes some other APIs. These are worth looking.
Grab a set of tweets relating to an interesting hastag. Extract the contents of the tweets and generate a word cloud.
Using rdrop2 write a script to do the following:

save data in a .RData file;
upload the data onto Dropbox; and
retrieve the sharing URL for that file.

Let’s say we’re interested in analysing the geographic data for all names and locations of licensed spirit bottlers and producers in the US. We’re interested in finding:

Which ZIP code has the most licensed vendors per capita?
Using the Google Maps API, find the states for the top 50 ZIP codes by licensed vendors per capita.
You’re given two datasets: a JSON file representing all of the names and locations of licensed spirit bottlers and producers and a CSV file giving US population by ZIP Code.
Here’s a hint on how to find the states (without searching every ZIP code or using a ZIP code to state reference file):

library(ggmap)
data <- c("10001", "10002", "10003", "23112")
zip_list <- geocode(data, output='latlona', messaging = TRUE)
zip_list$address

Getting Data from an API