API's

Topics: What are API's; Why use API's; Do to access API's

Many websites offer API’s for retrieving their content. If a website has an API, use it.

What is an API? (Application Programming Interface)

API is the acronym for Application Programming Interface, which is a software tool that allows two applications to talk to each other. In the case of webscraping, an API allows a web browser (i.e. FireFox or chrome) or programming language (i.e. python, R, javascript) to talk to a database that stores the data on a webpage or collected by a webpage. The API allows you to read, display or “download” that data.

If a website has an API, then use it. Generally, it won’t let you collect data that your not allowed to collect, and its by far the easiest way to collect a websites data.

Check here for (by no means comprehensive) lists of websites with API access. You can use some of these to practice with.

API list

Public APIs

Guardian.co.uk API

Australian news sources API

How do we use an API

You use URL’s to collect a websites data with an API, this is refered to as an API request. There are certain sections of the API request that can be manipulated to collect specific data. You will need to check out the websites API documentation for details.

Once you have a formulated a suitable API request, you will need to GET the data (i.e. download it onto your computer). This is most commonly achieved using a programming language, although you can paste the API request into your webbrowser and save the output. We generally use R or Python. If you are comfortable with either of these languages check these guides.

r-project

Python API Tutorial

Some commonly used websites, such as Twitter, have packages that can be used to interact with the API. Packages that use Twitter API’s:

R: rtweet

python: Tweepy

Building an API request

First we need to formulate an API request. This can consist of the following:

Access token - Authentication

Many websites require an access token for you to GET their data with an API. These are usually free, you just need to sign up. Once you have an access token you will need to put it into you API request. For example you will need to setup a twitter academic account to use their API.

Lets try it ourselves with a real example

NatureServe is a biodiversity conservation group that classifies, maps and sets biodiversity conservation goals. They have a huge range of biodiversity data for the Americas. There are a few pieces of information we need to create a get URL. This tutorial has been developed using the Mozilla Firefox browser, for ease of use we suggest you use this browser for coding and development. Download firefox here

We will GET information about the American black bear Ursus americanus.

All this information can be found in the websites API documentation

NatureServe API endpoint:

Relative path to get data for species data:

The id for a specific species:
The API documentation provides an example API request for the relative path/id “/api/data/taxon/ELEMENT_GLOBAL.2.154701”. From this we know that the API requires a species id that looks like “ELEMENT_GLOBAL.2.154701”. Lets search for the American black bear on the NatureServe website and see if we can find an id.

American black bear id
NatureServe id for the American black bear

Put it all together and paste it into your webbrowser to display the data.

First thing to do is test that your API request works, and that it GETs the correct data. To do this, paste the API request into your firefox browser, it should display the data nicely. If you only want a couple of requests then you can save the data using the save button to download it.

JSON formatted in firefox
NatureServe JSON data formatted in FireFox

If you downloaded this data, it saved as a json file (most common type, but can be XML or HTML). The API documentation will tell you what the data will be. If you want to read json data into python you will need to follow the below code. You can install the json viewer extension so that it is displayed nicely.

import json
import os

os.chdir('/Users/Downloads/') # You will need to change the directory to the one on your computer

with open('american_black_bear.json') as dat:
  data = json.load(dat)

If you have multiple API requests, or want to regularly collect the data, i.e. real-time updates, then you should use a programing language to GET the data.

This is JSON formatted data, you can also GET data from most API’s in XML and HTML format. You will most likely need to use a programming language to GET the data from the websites database onto your computer (or any other location you want to store/analyse the data).

Pull the data into python

import requests
import os
import json

api_request = 'https://explorer.natureserve.org/api/data/taxon/ELEMENT_GLOBAL.2.100661'
result = requests.get(api_request).json()

# Write the data to a json formatted text file on your computer
os.chdir('/Users/Documents/')
with open('American_black_bear.txt', 'w') as convert_file: 
     convert_file.write(json.dumps(result))

Pull the data into R

require(httr)
require(rjson)

api_request <- 'https://explorer.natureserve.org/api/data/taxon/ELEMENT_GLOBAL.2.100661'

get_request <- httr::GET(api_request)
results <- httr::content(get_request, as='text')
results_parsed <- fromJSON(results)

Another Example with covid data

Lets have a look at another API that provides global covid-19 statistics: https://api.covid19api.com/ Lets look at the documentation

quotes tag
Examples of API requests

Select python as the language and requests as the package, it will provide you the code. Unfortunately it doesn’t show the code for R.

The left column contains all the different datasets we can get with API requests. Click on the default tab then “view more” button to look at some examples of the relative paths for data we can get from this API.

quotes tag
API Documentation for the website 'https://api.covid19api.com/'
quotes tag
Example relative path to GET daily covid cases fof an individual country

This is the relative path from the example above: /dayone/country/:country/status/:status/live This API request requires 2 id’s, :country is the id for the target country and :status is the id for the target status.

Lets use this relative path to get the infection data for Canada from the date of first confirmed case.

We can get a list of countries by adding country to the base URL, i.e. https://api.covid19api.com/countries. Paste the full URL “https://api.covid19api.com/countries” into you web-browser, ideally firefox (it will be far easier to read).

Use the find function click “crtl F” in the browser and a search bar will pop up, search Canada.

quotes tag
Finding the country id for Canada

Our API request.

https://api.covid19api.com/dayone/country/canada/status/confirmed/live

import requests
import os
import json

api_request = 'https://api.covid19api.com/dayone/country/canada/status/confirmed/live'
result = requests.get(api_request).json()

# Write the data to a json formatted text file on your computer
os.chdir('/Users/Documents/')
with open('canada_covid_daily.txt', 'w') as convert_file: 
     convert_file.write(json.dumps(result))