An Application Programming Interface (API) is the means by which a piece of software exposes functionality. Ideally an API is well documented so that application programmers can easily interact with it. This ideal is not always achieved in practice.
We are going to be looking at a specific type of API: an interface exposed by a web site or a Web API.
The practice of publishing APIs has allowed web communities to create an open architecture for sharing content and data. In this way, content that is created in one place can be dynamically posted and updated in multiple locations on the web. For example, APIs allow for:
Another term which you are likely to hear in this context is REST or Representational State Transfer. REST is a stateless method for communicating between a client and a server, normally using the HTTP protocol.
A request is sent to the server using GET or POST HTTP methods and the server returns the response via raw HTTP. The interface can be extremely simple or it might accept a range of parameters.
Here are the URLs for two GET requests:
Open the links in a browser and the results will either be rendered in the browser window or saved to your Downloads folder.
These are some sites which expose useful APIs:
There are certainly many more. Have a look at ProgrammableWeb to get an idea of just how many.
Although some APIs allow anonymous requests, many require you to register and obtain an API key in order to use their service (or to have less limited access). Generally you can register for an API key at no cost, but in some cases the key is a way to monetise the API.
The httr
package implements functions for a range of HTTP methods, the most important at present are GET()
and POST()
.
library(httr)
mailtest = GET("https://api.mailtest.in/v1/totallyinvaliddomain.com")
The result of a call to GET()
is of class response
. The first thing that we should do is check that the request was successful using http_status()
to access the HTTP status code.
http_status(mailtest)
$category
[1] "Success"
$reason
[1] "OK"
$message
[1] "Success: (200) OK"
A value of 200 indicates success. Next we check the content of the response.
content(mailtest, as = "text", encoding = "UTF-8") # Raw JSON data
[1] "{\"code\":\"22\",\"status\":\"INVALID\",\"message\":\"Unregistered Domain\"}"
The response comes as a JSON document. Evidently that email domain is not valid. No surprises there. Let’s try a valid domain.
mailtest = GET("https://api.mailtest.in/v1/google.com")
Then check that the response is valid and look at the response data.
mailtest$status_code
[1] 200
content(mailtest) # Parsed JSON data -> list
$code
[1] "01"
$status
[1] "ACTIVE"
$message
[1] "OK"
Nice!
Let’s do something more useful. We’re going to access the OMDb API from within R. We’ve seen how to use GET()
. However, since a GET request is equivalent to simply opening the URL in a browser, there are (at least) two other ways to do this. We can treat the URL as a connection.
readLines("http://www.omdbapi.com/?t=Forrest+Gump&y=&plot=short&r=json")
Or use getURL()
from the RCurl
package.
library(RCurl)
(forrest.gump <- getURL("http://www.omdbapi.com/?t=Forrest+Gump&y=&plot=short&r=json"))
[1] "{\"Title\":\"Forrest Gump\",\"Year\":\"1994\",\"Rated\":\"PG-13\",\"Released\":\"06 Jul 1994\",\"Runtime\":\"142 min\",\"Genre\":\"Drama, Romance\",\"Director\":\"Robert Zemeckis\",\"Writer\":\"Winston Groom (novel), Eric Roth (screenplay)\",\"Actors\":\"Tom Hanks, Rebecca Williams, Sally Field, Michael Conner Humphreys\",\"Plot\":\"Forrest Gump, while not intelligent, has accidentally been present at many historic moments, but his true love, Jenny Curran, eludes him.\",\"Language\":\"English\",\"Country\":\"USA\",\"Awards\":\"Won 6 Oscars. Another 37 wins & 51 nominations.\",\"Poster\":\"http://ia.media-imdb.com/images/M/MV5BMTI1Nzk1MzQwMV5BMl5BanBnXkFtZTYwODkxOTA5._V1_SX300.jpg\",\"Metascore\":\"82\",\"imdbRating\":\"8.8\",\"imdbVotes\":\"1,225,276\",\"imdbID\":\"tt0109830\",\"Type\":\"movie\",\"Response\":\"True\"}"
Again the response is a JSON document. It’s a bit of a mess, right? We’ll clean it up and learn more about JSON shortly.
Interacting via POST is a little more complicated. We’ll demonstrate the basics with httpresponder. A GET request passes information as part of the URL. By contrast, a POST request sends data in the body of the HTTP packet. Check out ?POST
.
httpresponder = POST("http://httpresponder.com/ixdatascience.json", body = list(
course = "Data Science",
module = "Getting Data from an API"
))
content(httpresponder)
$students
[1] 15
$location
[1] "Cape Town"
The payload can take a variety of forms. Run the POST() command above and then visit httpresponder to view the request content as seen by the server.
JSON (JavaScript Object Notation) is a text based data-exchange format designed to be lightweight, easy for computers to parse and simple for humans to understand. It’s the de facto standard for the exchange of data over the interwebs.
JSON has six basic data types:
White space is permitted but ignored. Generally JSON data are minified for transmission. This is the case for the examples that we have seen above.
It’s important to know a little about the JSON format. However, in most situations you’ll not be required to interact directly with JSON data: there are libraries to do the heavy lifting for you. In R the best options are the jsonlite
, RJSONIO
and rjson
packages. The functionality in these packages is very similar. We’ll use jsonlite
here but you should feel free to experiment with the others.
library(jsonlite)
First let’s convert the JSON document we retrieved from OMDb into a more human friendly format.
prettify(forrest.gump)
{
"Title": "Forrest Gump",
"Year": "1994",
"Rated": "PG-13",
"Released": "06 Jul 1994",
"Runtime": "142 min",
"Genre": "Drama, Romance",
"Director": "Robert Zemeckis",
"Writer": "Winston Groom (novel), Eric Roth (screenplay)",
"Actors": "Tom Hanks, Rebecca Williams, Sally Field, Michael Conner Humphreys",
"Plot": "Forrest Gump, while not intelligent, has accidentally been present at many historic moments, but his true love, Jenny Curran, eludes him.",
"Language": "English",
"Country": "USA",
"Awards": "Won 6 Oscars. Another 37 wins & 51 nominations.",
"Poster": "http://ia.media-imdb.com/images/M/MV5BMTI1Nzk1MzQwMV5BMl5BanBnXkFtZTYwODkxOTA5._V1_SX300.jpg",
"Metascore": "82",
"imdbRating": "8.8",
"imdbVotes": "1,225,276",
"imdbID": "tt0109830",
"Type": "movie",
"Response": "True"
}
That’s a lot easier to digest. The document contains an object (key/value pairs). In this case all keys and values are strings.
The reverse operation can be done with minify()
. Normally JSON is stored and transferred in an unformatted, minified state since this reduces the size of the document.
We can parse the contents of a JSON document with fromJSON()
, which will convert it into a R data structure.
forrest.gump.parsed = fromJSON(forrest.gump)
class(forrest.gump.parsed)
[1] "list"
names(forrest.gump.parsed)
[1] "Title" "Year" "Rated" "Released" "Runtime" "Genre" "Director"
[8] "Writer" "Actors" "Plot" "Language" "Country" "Awards" "Poster"
[15] "Metascore" "imdbRating" "imdbVotes" "imdbID" "Type" "Response"
forrest.gump.parsed$Plot
[1] "Forrest Gump, while not intelligent, has accidentally been present at many historic moments, but his true love, Jenny Curran, eludes him."
The toJSON()
function will convert a R object into a JSON document.
The pair of functions unserializeJSON()
and serializeJSON()
perform a similar function, but provide greater fidelity for R objects. A serialised R object should capture essentially all aspects of the data, allowing it to be restored almost perfectly.
A JSON document is an ideal way to store unstructured or semi-structured data. Let’s illustrate this with a simple exercise. Everybody needs to create a personal JSON document. It should have the following minimum information: first named, surname, gender, date of birth. Add to that three other informative fields. For example, it might look like this:
{
"name": "Eric",
"surname": "Blair",
"gender": "Male",
"birth": "25/06/1903",
"death": "21/01/1950",
"books": ["1984", "Animal Farm"],
"pen_name": "George Orwell"
}
We’ll consolidate the entries for the whole class into a single JSON document.
You often don’t need to have low level interactions with an API because there are many R packages which have wrapped these interactions up into functions. We’ll be focussing on two packages in particular (Quandl
and twitteR
), but here are some others that you should know about:
Rlinkedin
instaR
(Instagram)ROpenWeatherMap
and rwunderground
rplos
(search journals from the Public Library of Science)translate
(Google Translate)WikidataR
datarobot
(Predictive Modeling API)telegram
(Twitter’s Telegram Bot API)gmailr
GuardianR
(news from The Observer and The Guardian) rdrop2
(Dropbox)Quandl
PackageThe Quandl
package is a wrapper around the Quandl API. Quandl is a vast repository of interesting data.
library(Quandl)
We’ll take a look at stock data for Apple. Visit the dashboard for these data on Quandl. Note that you can directly download the data in a range of formats.
AAPL = Quandl("WIKI/AAPL") # You can also supply an API key, but it's not mandatory
head(AAPL[, 1:5])
Date Open High Low Close
1 2016-05-27 99.44 100.47 99.245 100.35
2 2016-05-26 99.68 100.73 98.640 100.41
3 2016-05-25 98.67 99.74 98.110 99.62
4 2016-05-24 97.22 98.09 96.840 97.90
5 2016-05-23 95.87 97.19 95.670 96.43
6 2016-05-20 94.64 95.43 94.520 95.22
twitteR
PackageThe twitteR
package implements a high level interface to the Twitter API. It’s also worth mentioning streamR
, which wraps the Twitter Streaming API.
library(twitteR)
The first step towards interacting with this API is to create an authorisation key. You do this by creating an application here.
Creating a new Twitter application.
Fill in a Name and Desciption. You can put in a place filler like http://www.example.com for the Website (note that this needs to begin with http://). Leave the Callback URL empty. Submit the form.
On the following page go to the Keys and Access Tokens tab and make a note of the API Key and API Secret. Scroll down and create an Access Token. Make a note of the Access Token and Access Token Secret.
Twitter application keys and secrets.
Now that we’ve jumped through those hoops we can connect to the API.
consumer_key = "aYgTU4eYH3v4yzLwbpwFfkGAj"
consumer_secret = "JM126ueWzWNEDvamaGCS09WzijGI4ANcahxeyPujmxayS9gf0z"
access_token = "3320318445-12v4fKY0hUfbK84EpYnkBbojt3DyY6TEdqj6Tma"
access_secret = "MnMNiuMY7UGcH3f25gYzyV8rpqmlo98PbkaggGjMTI15j"
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
You’ll see this message in the Console.
[1] "Using browser based authentication"
Use a local file to cache OAuth access credentials between R sessions?
1: Yes
2: No
Selection:
Select 1.
You should now be ready to interact with the Twitter API. Let’s give it a test run: find 5 recent tweets mentioning “#datascience” originating from within 500 km of Durban.
(tweets <- searchTwitter('#datascience', geocode='-29.8579,31.0292,500km', n = 5))
[[1]]
[1] "DataWookie: @iXperienceCT #datascience Session 1 kicking off tomorrow! Going to be epic. #rstats #machinelearning @tableau"
[[2]]
[1] "7wData: Doing #datascience: A Kaggle Walkthrough Part 2 – Understanding the #Data\nhttps://t.co/BlqRaEdoFv https://t.co/NqrcebWErH"
[[3]]
[1] "7wData: #datascience set to transform wireless supply chain\nhttps://t.co/mR0KdhmkYZ https://t.co/k0r14TyyoW"
[[4]]
[1] "sentiwire: RT @rlnel: Cool interactive exploration tool for clustering algorithms and outliers | #DataScience #statistics | https://t.co/hDhMCJ1uOb"
[[5]]
[1] "rlnel: Conceptnet Numberbatch: The best word embeddings you can download #DataScience #statistics https://t.co/Pz6NjPGyOH"
Each of the returned items has class status
. Check out ?status
to find out about the functionality of this class.
tweets[[1]]$getText()
[1] "@iXperienceCT #datascience Session 1 kicking off tomorrow! Going to be epic. #rstats #machinelearning @tableau"
tweets[[1]]$getScreenName()
[1] "DataWookie"
You can turn each of the status
objects into a data frame using as.data.frame()
. These can then be concatenated to form a nice, tidy data set. You remember how to do this using do.call()
, right?
names(as.data.frame(tweets[[1]]))
[1] "text" "favorited" "favoriteCount" "replyToSN" "created" "truncated" "replyToSID"
[8] "id" "replyToUID" "statusSource" "screenName" "retweetCount" "isRetweet" "retweeted"
[15] "longitude" "latitude"
You can extract user information. Look at ?user
to find out about the user
class.
hadley = getUser("hadleywickham")
hadley$name
[1] "Hadley Wickham"
hadley$description
[1] "R, data, visualisation."
hadley$location
[1] "Houston, TX"
And, of course, you can also tweet. The result has class status
. Take a look at ?status
to see associated functionality.
msg = tweet("#ixdatascience Let's get this party started! @iXperienceCT #rstats @tableau")
A status update straight from RStudio.
msg$id
[1] "737149648503201793"
msg$created
[1] "2016-05-30 05:12:26 UTC"
with(msg, list(retweeted, retweetCount))
[[1]]
[1] FALSE
[[2]]
[1] 0
That should be enough to get you started. We’ve really just touched the surface though. The twitteR
package has excellent coverage across the Twitter API. You’re only limited by your imagination. Have fun! To get inspired, have a look at what other people have been doing with Twitter and R.
Identify the various data types in the following JSON document:
{
"firstName": "Jan",
"lastName": "van der Merwe",
"gender": male,
"height": 1.84,
"isAlive": true,
"age": 63,
"address": {
"streetAddress": "13 Burger Street",
"city": "Kakamas",
"state": "Northern Cape",
"country": "South Africa"
"postalCode": "8870"
},
"phoneNumbers": [
{
"type": "home",
"number": "054 365 2727"
},
{
"type": "mobile",
"number": "073 239 5730"
}
],
"children": [],
"spouse": null
}
Use fromJSON()
to parse the contents of this JSON document.times
field in the response to gather statistics on the distribution of DNS lookup and connection times.rdrop2
write a script to do the following:.RData
file;library(ggmap)
data <- c("10001", "10002", "10003", "23112")
zip_list <- geocode(data, output='latlona', messaging = TRUE)
zip_list$address