Access Open Data Through the Junar API

Publicado el lunes, 28 de marzo de 2016 | último cambio el viernes, 13 de mayo de 2016

Introduction

The Junar API is the basis for a number of Open Data initiatives in Latin America and the USA. The junr package is a wrapper to make it easier to access data made public through the Junar API. Some examples of implementations are: the City of Pasadena, and the City of San Jose. Others are listed on the Junar website.

The package has been published on CRAN and can be installed directly in R using:

install.packages("junr")

If you prefer to use the latest development version, you can find it on Github and install it using the devtools package as described on the page github.com/fvd/junr

While the Junar API is part of a commercial platform from Junar Inc., the use of the data in all the implementations mentioned above does not have any cost for the user. Junar was designed to make it easier for organizations to open their data and to promote the use of open data. As a user you will need to create a new API-Key for each collection of data sets you are interested in, but there is not cost associated with doing so.

Browsing data

As an example we will use the data from the Costa Rican President's Office.The first step is to access the website offering the open data to identify the base URL and to obtain an API Key to get access to the Junar API that hosts the data. You will find both on the developers page of the Open Data Costa Rica site.

Below we use a test API Key so that all the examples will run. You may want to get your own API Key instead to run the examples below. Note that with Junar each URL has its own API key.

library(junr)
base_url <- "http://api.datosabiertos.presidencia.go.cr/api/v2/datastreams/"
api_key <- "0bd55e858409eefabc629b28b2e7916361ef20ff"

Now that we have the basic information for a connection we can quickly check what data is available behind this URL.

get_index(base_url, api_key)

The get_indexfunction returns the complete list of available data with all meta-data included as a data frame.

To get only a list of the global unique identifiers (GUID) of the data sets, you can use list_guid.

list_guid(base_url, api_key)
##  [1] "COMPR-PUBLI-DEL-MINIS"      "COMPR-PUBLI-DE-PRESI"      
##  [3] "PLANI-DEL-MINIS"            "LICIT-ADJUD-POR-LOS-MINIS" 
##  [5] "LICIT-ADJUD-POR-LAS-INSTI"  "LICIT-ADJUD-DE-LAS-INSTI"  
##  [7] "LICIT-ADJUD-POR-LAS-81483"  "PLANI-DE-SALAR-MINIS-65188"
##  [9] "PLANI-DE-SALAR-PRESI-DE"    "INFOR-DE-HORAS-EXTRA-67320"
## [11] "INFOR-DE-HORAS-EXTRA-01"    "INFOR-DE-HORAS-EXTRA-7"    
## [13] "VISTA"                      "EJECU-DE-PRESU-DE-50724"   
## [15] "DATOS-CORRE-AL-PAGO-DE"     "DATOS-CORRE-AL-PAGO-32327" 
## [17] "DESCR-DE-ABREV-DE-LAS"      "EJECU-DE-PRESU-DE-INSTI"

You can also make a list of the titles of the data sets:

list_titles(base_url, api_key)
##  [1] "Compras públicas del Ministerio de la Presidencia"                               
##  [2] "Compras públicas de Presidencia"                                                 
##  [3] "Ministerio de la Presidencia"                                                    
##  [4] "Licitaciones adjudicadas por los Ministerios"                                    
##  [5] "Licitaciones adjudicadas por las Instituciones Públicas según año"               
##  [6] "Licitaciones Adjudicadas de las Instituciones Públicas para el período 2014-2015"
##  [7] "Licitaciones adjudicadas por las Instituciones Públicas según tipo de trámite"   
##  [8] "Abril 2016: Planilla de salarios: Ministerio de la Presidencia"                  
##  [9] "Abril 2016: Planilla de Salarios Presidencia de la República"                    
## [10] "Informe de Horas Extra: 01 de enero 2015 al 31 de diciembre de 2015"             
## [11] "Informe de Horas Extra: 01 de enero 2015 al 31 de diciembre de 2015"             
## [12] "Informe de Horas Extra: 7 de mayo 2014 al 31 de diciembre 2014."                 
## [13] "Informe de Horas Extra: 01 de enero 2016 al 30 de abril 2016"                    
## [14] "Ejecución de presupuesto de Instituciones para el 2014"                          
## [15] "Presidencia de la República"                                                     
## [16] "Datos correspondientes al pago de planilla del Ministerio"                       
## [17] "Descripción de abreviaturas de las ejecuciones "                                 
## [18] "Ejecución de presupuesto de Instituciones para el 2015"

Both list_guid and list_titles where set up for convenience only because the results tend to fit in the console window making it easier to read. They are meant to help to get a quick overview of the available data.

Downloading data to R

You need to know the Global Unique Identifier (GUID) of the data set that you are interested in to be able to download them to your R session. You can look for the GUID on the web page that shows the data of interest. For example on the page for public expenditure of the Costa Rican government there is a table called "Public Purchasing of the Ministry of the Presidency". In the menu underneath the table you have an option to "Obtain GUID". This last option opens a pop-up showing the GUID "COMPR-PUBLI-DEL-MINIS" that we are going to use in the example below.

data_guid <- "COMPR-PUBLI-DEL-MINIS"
purchasing_data <- get_data(base_url, api_key, data_guid)

With View(purchasing_data) you can check whether the data have been downloaded correctly, and have a quick visual check on the mode of the data (see below to convert currency data from text to numeric).

You may note that if you do not need to go to the web interface to get the GUID for any data sets of interest. With the function list_guid() as we used it above, we obtained the same information.

pres_list <-list_guid(base_url, api_key)
pres_list[3]
## [1] "PLANI-DEL-MINIS"

We can get the GUID we are interested in by fetching the third entry in the list of GUID's (see the full list in the example above). And the same index numbers can be used with a list of full titles created with list_titles().

Determine data dimensions

On data platforms that run Junar, many data sets are just tables of data that has already been analyzed and summarized. It is not immediately obvious which sets contain many data points, and which sets contain only a few rows.

The function get_dimensions will download all data sets offered through the base URL and determine how many rows and columns are available in each one. It is useful to make a quick assessment of the data available. However, please note that it may take a while before the function finishes, especially if there are many GUID's.

get_dimensions(base_url, api_key)
##                          GUID   NROW NCOL    DIM
## 2       COMPR-PUBLI-DEL-MINIS    324    4   1296
## 21       COMPR-PUBLI-DE-PRESI    427    4   1708
## 3             PLANI-DEL-MINIS   5561    8  44488
## 4   LICIT-ADJUD-POR-LOS-MINIS     10    2     20
## 5   LICIT-ADJUD-POR-LAS-INSTI      3    2      6
## 6    LICIT-ADJUD-DE-LAS-INSTI 103471    7 724297
## 7   LICIT-ADJUD-POR-LAS-81483      7    2     14
## 8  PLANI-DE-SALAR-MINIS-65188   6070   14  84980
## 9     PLANI-DE-SALAR-PRESI-DE   3296   13  42848
## 10 INFOR-DE-HORAS-EXTRA-67320    386    7   2702
## 11    INFOR-DE-HORAS-EXTRA-01    386    7   2702
## 12     INFOR-DE-HORAS-EXTRA-7    182    7   1274
## 13                      VISTA    386    8   3088
## 14    EJECU-DE-PRESU-DE-50724   9249   40 369960
## 15     DATOS-CORRE-AL-PAGO-DE   2472   10  24720
## 16  DATOS-CORRE-AL-PAGO-32327   5561   10  55610
## 17      DESCR-DE-ABREV-DE-LAS     27    4    108
## 18    EJECU-DE-PRESU-DE-INSTI   8867   39 345813

Clean up currency data

In the example data above, and possibly in more Junar implementations, we need to clean up any data related to currency values. In our case we need to found all currency symbols (Costa Rica Colon) and all the comma's separating thousands. As they stand these values are text strings, and cannot be converted directly to numeric without removing the symbols and commas.

There are two utilities to help cleaning the currency data: clean_currency and get_currency_symbol. For example:

currency_data <- get_data(base_url, api_key, "LICIT-ADJUD-POR-LOS-MINIS")
currency_data$`Monto Adjudicado` <- clean_currency(currency_data$`Monto Adjudicado`)

Acknowledgements and Notes