jsonlite/ 0000755 0001751 0000144 00000000000 13114001231 012101 5 ustar hornik users jsonlite/inst/ 0000755 0001751 0000144 00000000000 13110635545 013077 5 ustar hornik users jsonlite/inst/CITATION 0000644 0001751 0000144 00000001134 13110635545 014233 0 ustar hornik users citHeader("To cite jsonlite in publications use:") citEntry(entry = "Article", title = "The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects", author = personList(as.person("Jeroen Ooms")), journal = "arXiv:1403.2805 [stat.CO]", year = "2014", url = "https://arxiv.org/abs/1403.2805", textVersion = paste("Jeroen Ooms (2014).", "The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects.", "arXiv:1403.2805 [stat.CO]", "URL https://arxiv.org/abs/1403.2805.") ) jsonlite/inst/doc/ 0000755 0001751 0000144 00000000000 13113546477 013654 5 ustar hornik users jsonlite/inst/doc/json-aaquickstart.R 0000644 0001751 0000144 00000003316 13113546474 017442 0 ustar hornik users ## ----echo=FALSE---------------------------------------------------------- library(knitr) opts_chunk$set(comment="") #this replaces tabs by spaces because latex-verbatim doesn't like tabs #no longer needed because yajl does not use tabs. #toJSON <- function(...){ # gsub("\t", " ", jsonlite::toJSON(...), fixed=TRUE); #} ## ----message=FALSE------------------------------------------------------- library(jsonlite) all.equal(mtcars, fromJSON(toJSON(mtcars))) ## ------------------------------------------------------------------------ # A JSON array of primitives json <- '["Mario", "Peach", null, "Bowser"]' # Simplifies into an atomic vector fromJSON(json) ## ------------------------------------------------------------------------ # No simplification: fromJSON(json, simplifyVector = FALSE) ## ------------------------------------------------------------------------ json <- '[ {"Name" : "Mario", "Age" : 32, "Occupation" : "Plumber"}, {"Name" : "Peach", "Age" : 21, "Occupation" : "Princess"}, {}, {"Name" : "Bowser", "Occupation" : "Koopa"} ]' mydf <- fromJSON(json) mydf ## ------------------------------------------------------------------------ mydf$Ranking <- c(3, 1, 2, 4) toJSON(mydf, pretty=TRUE) ## ------------------------------------------------------------------------ json <- '[ [1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12] ]' mymatrix <- fromJSON(json) mymatrix ## ------------------------------------------------------------------------ toJSON(mymatrix, pretty = TRUE) ## ------------------------------------------------------------------------ json <- '[ [[1, 2], [3, 4]], [[5, 6], [7, 8]], [[9, 10], [11, 12]] ]' myarray <- fromJSON(json) myarray[1, , ] myarray[ , ,1] jsonlite/inst/doc/json-paging.html 0000644 0001751 0000144 00002602560 13113546475 016767 0 ustar hornik users
The jsonlite package is a JSON
parser/generator for R which is optimized for pipelines and web APIs. It is used by the OpenCPU system and many other packages to get data in and out of R using the JSON
format.
One of the main strengths of jsonlite
is that it implements a bidirectional mapping between JSON and data frames. Thereby it can convert nested collections of JSON records, as they often appear on the web, immediately into the appropriate R structure. For example to grab some data from ProPublica we can simply use:
library(jsonlite)
mydata <- fromJSON("https://projects.propublica.org/forensics/geos.json", flatten = TRUE)
View(mydata)
The mydata
object is a data frame which can be used directly for modeling or visualization, without the need for any further complicated data manipulation.
A question that comes up frequently is how to combine pages of data. Most web APIs limit the amount of data that can be retrieved per request. If the client needs more data than what can fits in a single request, it needs to break down the data into multiple requests that each retrieve a fragment (page) of data, not unlike pages in a book. In practice this is often implemented using a page
parameter in the API. Below an example from the ProPublica Nonprofit Explorer API where we retrieve the first 3 pages of tax-exempt organizations in the USA, ordered by revenue:
baseurl <- "https://projects.propublica.org/nonprofits/api/v1/search.json?order=revenue&sort_order=desc"
mydata0 <- fromJSON(paste0(baseurl, "&page=0"), flatten = TRUE)
mydata1 <- fromJSON(paste0(baseurl, "&page=1"), flatten = TRUE)
mydata2 <- fromJSON(paste0(baseurl, "&page=2"), flatten = TRUE)
#The actual data is in the filings element
mydata0$filings[1:10, c("organization.sub_name", "organization.city", "totrevenue")]
organization.sub_name organization.city totrevenue
1 KAISER FOUNDATION HEALTH PLAN INC OAKLAND 40148558254
2 KAISER FOUNDATION HEALTH PLAN INC OAKLAND 37786011714
3 KAISER FOUNDATION HOSPITALS OAKLAND 20796549014
4 KAISER FOUNDATION HOSPITALS OAKLAND 17980030355
5 PARTNERS HEALTHCARE SYSTEM INC SOMERVILLE 10619215354
6 UPMC PITTSBURGH 10098163008
7 UAW RETIREE MEDICAL BENEFITS TR DETROIT 9890722789
8 THRIVENT FINANCIAL FOR LUTHERANS MINNEAPOLIS 9475129863
9 THRIVENT FINANCIAL FOR LUTHERANS MINNEAPOLIS 9021585970
10 DIGNITY HEALTH SAN FRANCISCO 8718896265
To analyze or visualize these data, we need to combine the pages into a single dataset. We can do this with the rbind_pages
function. Note that in this example, the actual data is contained by the filings
field:
#Rows per data frame
nrow(mydata0$filings)
[1] 25
#Combine data frames
filings <- rbind_pages(
list(mydata0$filings, mydata1$filings, mydata2$filings)
)
#Total number of rows
nrow(filings)
[1] 75
We can write a simple loop that automatically downloads and combines many pages. For example to retrieve the first 20 pages with non-profits from the example above:
#store all pages in a list first
baseurl <- "https://projects.propublica.org/nonprofits/api/v1/search.json?order=revenue&sort_order=desc"
pages <- list()
for(i in 0:20){
mydata <- fromJSON(paste0(baseurl, "&page=", i))
message("Retrieving page ", i)
pages[[i+1]] <- mydata$filings
}
#combine all into one
filings <- rbind_pages(pages)
#check output
nrow(filings)
[1] 525
colnames(filings)
[1] "tax_prd" "tax_prd_yr"
[3] "formtype" "pdf_url"
[5] "updated" "totrevenue"
[7] "totfuncexpns" "totassetsend"
[9] "totliabend" "pct_compnsatncurrofcr"
[11] "tax_pd" "subseccd"
[13] "unrelbusinccd" "initiationfees"
[15] "grsrcptspublicuse" "grsincmembers"
[17] "grsincother" "totcntrbgfts"
[19] "totprgmrevnue" "invstmntinc"
[21] "txexmptbndsproceeds" "royaltsinc"
[23] "grsrntsreal" "grsrntsprsnl"
[25] "rntlexpnsreal" "rntlexpnsprsnl"
[27] "rntlincreal" "rntlincprsnl"
[29] "netrntlinc" "grsalesecur"
[31] "grsalesothr" "cstbasisecur"
[33] "cstbasisothr" "gnlsecur"
[35] "gnlsothr" "netgnls"
[37] "grsincfndrsng" "lessdirfndrsng"
[39] "netincfndrsng" "grsincgaming"
[41] "lessdirgaming" "netincgaming"
[43] "grsalesinvent" "lesscstofgoods"
[45] "netincsales" "miscrevtot11e"
[47] "compnsatncurrofcr" "othrsalwages"
[49] "payrolltx" "profndraising"
[51] "txexmptbndsend" "secrdmrtgsend"
[53] "unsecurednotesend" "retainedearnend"
[55] "totnetassetend" "nonpfrea"
[57] "gftgrntsrcvd170" "txrevnuelevied170"
[59] "srvcsval170" "grsinc170"
[61] "grsrcptsrelated170" "totgftgrntrcvd509"
[63] "grsrcptsadmissn509" "txrevnuelevied509"
[65] "srvcsval509" "subtotsuppinc509"
[67] "totsupp509" "ein"
[69] "organization" "eostatus"
[71] "tax_yr" "operatingcd"
[73] "assetcdgen" "transinccd"
[75] "subcd" "grscontrgifts"
[77] "intrstrvnue" "dividndsamt"
[79] "totexcapgn" "totexcapls"
[81] "grsprofitbus" "otherincamt"
[83] "compofficers" "contrpdpbks"
[85] "totrcptperbks" "totexpnspbks"
[87] "excessrcpts" "totexpnsexempt"
[89] "netinvstinc" "totaxpyr"
[91] "adjnetinc" "invstgovtoblig"
[93] "invstcorpstk" "invstcorpbnd"
[95] "totinvstsec" "fairmrktvalamt"
[97] "undistribincyr" "cmpmininvstret"
[99] "sec4940notxcd" "sec4940redtxcd"
[101] "infleg" "contractncd"
[103] "claimstatcd" "propexchcd"
[105] "brwlndmnycd" "furngoodscd"
[107] "paidcmpncd" "trnsothasstscd"
[109] "agremkpaycd" "undistrinccd"
[111] "dirindirintcd" "invstjexmptcd"
[113] "propgndacd" "excesshldcd"
[115] "grntindivcd" "nchrtygrntcd"
[117] "nreligiouscd" "grsrents"
[119] "costsold" "totrcptnetinc"
[121] "trcptadjnetinc" "topradmnexpnsa"
[123] "topradmnexpnsb" "topradmnexpnsd"
[125] "totexpnsnetinc" "totexpnsadjnet"
[127] "othrcashamt" "mrtgloans"
[129] "othrinvstend" "fairmrktvaleoy"
[131] "mrtgnotespay" "tfundnworth"
[133] "invstexcisetx" "sect511tx"
[135] "subtitleatx" "esttaxcr"
[137] "txwithldsrc" "txpaidf2758"
[139] "erronbkupwthld" "estpnlty"
[141] "balduopt" "crelamt"
[143] "tfairmrktunuse" "distribamt"
[145] "adjnetinccola" "adjnetinccolb"
[147] "adjnetinccolc" "adjnetinccold"
[149] "adjnetinctot" "qlfydistriba"
[151] "qlfydistribb" "qlfydistribc"
[153] "qlfydistribd" "qlfydistribtot"
[155] "valassetscola" "valassetscolb"
[157] "valassetscolc" "valassetscold"
[159] "valassetstot" "qlfyasseta"
[161] "qlfyassetb" "qlfyassetc"
[163] "qlfyassetd" "qlfyassettot"
[165] "endwmntscola" "endwmntscolb"
[167] "endwmntscolc" "endwmntscold"
[169] "endwmntstot" "totsuprtcola"
[171] "totsuprtcolb" "totsuprtcolc"
[173] "totsuprtcold" "totsuprttot"
[175] "pubsuprtcola" "pubsuprtcolb"
[177] "pubsuprtcolc" "pubsuprtcold"
[179] "pubsuprttot" "grsinvstinca"
[181] "grsinvstincb" "grsinvstincc"
[183] "grsinvstincd" "grsinvstinctot"
From here, we can go straight to analyzing the filings data without any further tedious data manipulation.
This section lists some examples of public HTTP APIs that publish data in JSON format. These are great to get a sense of the complex structures that are encountered in real world JSON data. All services are free, but some require registration/authentication. Each example returns lots of data, therefore not all output is printed in this document.
library(jsonlite)
Github is an online code repository and has APIs to get live data on almost all activity. Below some examples from a well known R package and author:
hadley_orgs <- fromJSON("https://api.github.com/users/hadley/orgs")
hadley_repos <- fromJSON("https://api.github.com/users/hadley/repos")
gg_commits <- fromJSON("https://api.github.com/repos/hadley/ggplot2/commits")
gg_issues <- fromJSON("https://api.github.com/repos/hadley/ggplot2/issues")
#latest issues
paste(format(gg_issues$user$login), ":", gg_issues$title)
[1] "jsta : fix broken stowers link"
[2] "krlmlr : Log transform on geom_bar() silently omits layer"
[3] "yutannihilation : Fix a broken link in README"
[4] "raubreywhite : Fix theme_gray's legend/panels for large base_size"
[5] "batuff : Add minor ticks to axes"
[6] "mcol : overlapping boxes with geom_boxplot(varwidth=TRUE)"
[7] "karawoo : Fix density calculations for groups with one or two elements"
[8] "Thieffen : fix typo"
[9] "Thieffen : fix typo"
[10] "thjwong : `axis.line` works, but not `axis.line.x` and `axis.line.y`"
[11] "schloerke : scale_discrete not listening to 'breaks' arg"
[12] "hadley : Consider use of vwline"
[13] "JTapper : geom_polygon accessing data$y"
[14] "Ax3man : Added linejoin parameter to geom_segment."
[15] "LSanselme : geom_density with groups of 1 or 2 elements"
[16] "philstraforelli : (feature request) Changing facet_wrap strip colour based on variable in data frame"
[17] "eliocamp : geom_tile() + coord_map() is extremely slow."
[18] "eliocamp : facet_wrap() doesn't play well with expressions in facets. "
[19] "dantonnoriega : Request: Quick visual example for each geom at http://ggplot2.tidyverse.org/reference/"
[20] "randomgambit : it would be nice to have date_breaks('0.2 sec')"
[21] "adrfantini : Labels can overlap in coord_sf()"
[22] "adrfantini : borders() is incompatible with coord_sf() with projected coordinates"
[23] "adrfantini : coord_proj() is superior to coord_map() and could be included in the default ggplot"
[24] "adrfantini : Coordinates labels and gridlines are wrong in coord_map()"
[25] "jonocarroll : Minor typo: monotonous -> monotonic"
[26] "FabianRoger : label.size in geom_label is ignored when printing to pdf"
[27] "andrewdolman : Add note recommending annotate"
[28] "Henrik-P : scale_identity doesn't play well with guide = \"legend\""
[29] "cpsievert : stat_sf(geom = \"text\")"
[30] "hadley : Automatically fill in x for univariate boxplot"
A single public API that shows location, status and current availability for all stations in the New York City bike sharing imitative.
citibike <- fromJSON("http://citibikenyc.com/stations/json")
stations <- citibike$stationBeanList
colnames(stations)
[1] "id" "stationName"
[3] "availableDocks" "totalDocks"
[5] "latitude" "longitude"
[7] "statusValue" "statusKey"
[9] "availableBikes" "stAddress1"
[11] "stAddress2" "city"
[13] "postalCode" "location"
[15] "altitude" "testStation"
[17] "lastCommunicationTime" "landMark"
nrow(stations)
[1] 666
The Ergast Developer API is an experimental web service which provides a historical record of motor racing data for non-commercial purposes.
res <- fromJSON('http://ergast.com/api/f1/2004/1/results.json')
drivers <- res$MRData$RaceTable$Races$Results[[1]]$Driver
colnames(drivers)
[1] "driverId" "code" "url" "givenName"
[5] "familyName" "dateOfBirth" "nationality" "permanentNumber"
drivers[1:10, c("givenName", "familyName", "code", "nationality")]
givenName familyName code nationality
1 Michael Schumacher MSC German
2 Rubens Barrichello BAR Brazilian
3 Fernando Alonso ALO Spanish
4 Ralf Schumacher SCH German
5 Juan Pablo Montoya MON Colombian
6 Jenson Button BUT British
7 Jarno Trulli TRU Italian
8 David Coulthard COU British
9 Takuma Sato SAT Japanese
10 Giancarlo Fisichella FIS Italian
Below an example from the ProPublica Nonprofit Explorer API where we retrieve the first 10 pages of tax-exempt organizations in the USA, ordered by revenue. The rbind_pages
function is used to combine the pages into a single data frame.
#store all pages in a list first
baseurl <- "https://projects.propublica.org/nonprofits/api/v1/search.json?order=revenue&sort_order=desc"
pages <- list()
for(i in 0:10){
mydata <- fromJSON(paste0(baseurl, "&page=", i), flatten=TRUE)
message("Retrieving page ", i)
pages[[i+1]] <- mydata$filings
}
#combine all into one
filings <- rbind_pages(pages)
#check output
nrow(filings)
[1] 275
filings[1:10, c("organization.sub_name", "organization.city", "totrevenue")]
organization.sub_name organization.city totrevenue
1 KAISER FOUNDATION HEALTH PLAN INC OAKLAND 40148558254
2 KAISER FOUNDATION HEALTH PLAN INC OAKLAND 37786011714
3 KAISER FOUNDATION HOSPITALS OAKLAND 20796549014
4 KAISER FOUNDATION HOSPITALS OAKLAND 17980030355
5 PARTNERS HEALTHCARE SYSTEM INC SOMERVILLE 10619215354
6 UPMC PITTSBURGH 10098163008
7 UAW RETIREE MEDICAL BENEFITS TR DETROIT 9890722789
8 THRIVENT FINANCIAL FOR LUTHERANS MINNEAPOLIS 9475129863
9 THRIVENT FINANCIAL FOR LUTHERANS MINNEAPOLIS 9021585970
10 DIGNITY HEALTH SAN FRANCISCO 8718896265
The New York Times has several APIs as part of the NYT developer network. These interface to data from various departments, such as news articles, book reviews, real estate, etc. Registration is required (but free) and a key can be obtained at here. The code below includes some example keys for illustration purposes.
#search for articles
article_key <- "&api-key=b75da00e12d54774a2d362adddcc9bef"
url <- "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=obamacare+socialism"
req <- fromJSON(paste0(url, article_key))
articles <- req$response$docs
colnames(articles)
[1] "web_url" "snippet" "lead_paragraph"
[4] "abstract" "print_page" "blog"
[7] "source" "multimedia" "headline"
[10] "keywords" "pub_date" "document_type"
[13] "news_desk" "section_name" "subsection_name"
[16] "byline" "type_of_material" "_id"
[19] "word_count" "slideshow_credits"
#search for best sellers
books_key <- "&api-key=76363c9e70bc401bac1e6ad88b13bd1d"
url <- "http://api.nytimes.com/svc/books/v2/lists/overview.json?published_date=2013-01-01"
req <- fromJSON(paste0(url, books_key))
bestsellers <- req$results$list
category1 <- bestsellers[[1, "books"]]
subset(category1, select = c("author", "title", "publisher"))
author title publisher
1 Gillian Flynn GONE GIRL Crown Publishing
2 John Grisham THE RACKETEER Knopf Doubleday Publishing
3 E L James FIFTY SHADES OF GREY Knopf Doubleday Publishing
4 Nicholas Sparks SAFE HAVEN Grand Central Publishing
5 David Baldacci THE FORGOTTEN Grand Central Publishing
#movie reviews
movie_key <- "&api-key=b75da00e12d54774a2d362adddcc9bef"
url <- "http://api.nytimes.com/svc/movies/v2/reviews/dvd-picks.json?order=by-date"
req <- fromJSON(paste0(url, movie_key))
reviews <- req$results
colnames(reviews)
[1] "display_title" "mpaa_rating" "critics_pick"
[4] "byline" "headline" "summary_short"
[7] "publication_date" "opening_date" "date_updated"
[10] "link" "multimedia"
reviews[1:5, c("display_title", "byline", "mpaa_rating")]
display_title byline mpaa_rating
1 Hermia & Helena GLENN KENNY
2 The Women's Balcony NICOLE HERRINGTON
3 Long Strange Trip DANIEL M. GOLD R
4 Joshua: Teenager vs. Superpower KEN JAWOROWSKI
5 Berlin Syndrome GLENN KENNY R
The Sunlight Foundation is a non-profit that helps to make government transparent and accountable through data, tools, policy and journalism. Register a free key at here. An example key is provided.
key <- "&apikey=39c83d5a4acc42be993ee637e2e4ba3d"
#Find bills about drones
drone_bills <- fromJSON(paste0("http://openstates.org/api/v1/bills/?q=drone", key))
drone_bills$title <- substring(drone_bills$title, 1, 40)
print(drone_bills[1:5, c("title", "state", "chamber", "type")])
title state chamber type
1 AIRPORT AUTHORITIES-DRONES il upper bill
2 Study Drone Use By Public Safety Agencie co lower bill
3 AIRCRAFT/AVIATION:Â Â Provides for the exc la upper bill
4 relative to the use of drones. nh lower bill
5 Use or Operation of a Drone by Certain O fl lower bill
#Local legislators
legislators <- fromJSON(paste0("http://congress.api.sunlightfoundation.com/",
"legislators/locate?latitude=42.96&longitude=-108.09", key))
subset(legislators$results, select=c("last_name", "chamber", "term_start", "twitter_id"))
last_name chamber term_start twitter_id
1 Cheney house 2017-01-03 RepLizCheney
2 Enzi senate 2015-01-06 SenatorEnzi
3 Barrasso senate 2013-01-03 SenJohnBarrasso
The twitter API requires OAuth2 authentication. Some example code:
#Create your own appication key at https://dev.twitter.com/apps
consumer_key = "EZRy5JzOH2QQmVAe9B4j2w";
consumer_secret = "OIDC4MdfZJ82nbwpZfoUO4WOLTYjoRhpHRAWj6JMec";
#Use basic auth
secret <- jsonlite::base64_enc(paste(consumer_key, consumer_secret, sep = ":"))
req <- httr::POST("https://api.twitter.com/oauth2/token",
httr::add_headers(
"Authorization" = paste("Basic", gsub("\n", "", secret)),
"Content-Type" = "application/x-www-form-urlencoded;charset=UTF-8"
),
body = "grant_type=client_credentials"
);
#Extract the access token
httr::stop_for_status(req, "authenticate with twitter")
token <- paste("Bearer", httr::content(req)$access_token)
#Actual API call
url <- "https://api.twitter.com/1.1/statuses/user_timeline.json?count=10&screen_name=Rbloggers"
req <- httr::GET(url, httr::add_headers(Authorization = token))
json <- httr::content(req, as = "text")
tweets <- fromJSON(json)
substring(tweets$text, 1, 100)
[1] "simmer 3.6.2 https://t.co/rRxgY2Ypfa #rstats #DataScience"
[2] "Getting data for every Census tract in the US with purrr and tidycensus https://t.co/B3NYJS8sLO #rst"
[3] "Gender Roles with Text Mining and N-grams https://t.co/Rwj0IaTiAR #rstats #DataScience"
[4] "Data Science Podcasts https://t.co/SaAuO82a7M #rstats #DataScience"
[5] "Reflections on ROpenSci Unconference 2017 https://t.co/87kMldvrsd #rstats #DataScience"
[6] "Summarizing big data in R https://t.co/GMaZZ9sWiL #rstats #DataScience"
[7] "Mining CRAN DESCRIPTION Files https://t.co/gWEIAYaBZF #rstats #DataScience"
[8] "New package polypoly (helper functions for orthogonal polynomials) https://t.co/MzzzcIySym #rstats #"
[9] "Hospital Infection Scores – R Shiny App https://t.co/Rf8wKNBPU6 #rstats #DataScience"
[10] "New R job: Software Engineer in Test for RStudio https://t.co/X1bWkKlzYv #rstats #DataScience #jobs"