--- title: "Introduction to tidyjson" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to tidyjson} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, echo = FALSE, message = FALSE} knitr::opts_chunk$set(collapse = T, comment = "#>") options(tibble.print_min = 4L, tibble.print_max = 4L) library(tidyjson) ``` [JSON](http://json.org/) (JavaScript Object Notation) is a lightweight and flexible data format that is easy for humans to read and for machines to parse. JSON has become a common format used in: * Public APIs (e.g., [Twitter](https://dev.twitter.com/rest/public)) * NoSQL databases as a document format (e.g., [MongoDB](https://www.mongodb.com/)) * Relational databases as a new column type (e.g., [PostgreSQL](http://www.postgresql.org/docs/9.4/static/datatype-json.html)) Tidyjson provides a grammar for turning complex JSON data into tidy data frames that are easy to work with in the [tidyverse](https://github.com/tidyverse/tidyverse). ## Why use tidyjson? Several libraries exist for working with JSON data in R, such as [rjson](https://cran.r-project.org/package=rjson), [rjsonio](https://cran.r-project.org/package=RJSONIO) and [jsonlite](https://cran.r-project.org/package=jsonlite). These libraries transform JSON data automatically into nested R lists or complex data frames. However, working with these complex objects can be difficult. The tidyjson package takes a different approach to structuring JSON data into tidy data frames. Similar to [tidyr](https://cran.r-project.org/package=tidyr), tidyjson builds a grammar for manipulating JSON into a tidy table structure. Tidyjson is based on the following principles: * Work on a single JSON document, or on a collection of related documents * Create pipelines with `%>%`, producing code that can be read from left to right * Guarantee the structure of the data produced, even if the input JSON structure changes (with the exception of `spread_all`) * Work with arbitrarily nested arrays or objects * Handle 'ragged' arrays and / or objects (varying lengths by document) * Allow for extraction of data in values or object names * Ensure edge cases are handled correctly (especially empty data) * Integrate seamlessly with `dplyr`, allowing `tbl_json` objects to pipe in and out of `dplyr` verbs where reasonable ## A simple example A simple example of how tidyjson works is as follows: ```{r, message = FALSE} library(dplyr) # Define a simple people JSON collection people <- c('{"age": 32, "name": {"first": "Bob", "last": "Smith"}}', '{"age": 54, "name": {"first": "Susan", "last": "Doe"}}', '{"age": 18, "name": {"first": "Ann", "last": "Jones"}}') # Tidy the JSON data people %>% spread_all ``` This produces a `tbl_json` object, where each row corresponds to an element of the `people` vector (a "document" in tidyjson). The JSON attribute of the `tbl_json` object is shown first, then the columns of the tibble are shown - a `document.id` indicating which document the row originated in, and then the age and name columns that `spread_all` created. ## A more complex example The tidyjson package really shines in a more complex example. Consider the `worldbank` data included in the tidyjson package. ```{r} worldbank %>% str ``` It is a `r length(worldbank)` length character vector of projects funded by the world bank. We can quickly expand all simple columns using `spread_all` ```{r} worldbank %>% spread_all ``` And we can limit the coluns produced by calling `dplyr::select` after ```{r} worldbank %>% spread_all %>% select(regionname, totalamt) ``` But worldbank also contains arrays, which cannot be naively spread into new columns. We can use `gather_object` to gather all name-value paris by name, and then `json_types` to identify the type of JSON stored under each value, and `dplyr::count` to aggregate across documents: ```{r, echo = FALSE, message = FALSE} options(tibble.print_min = 10L, tibble.print_max = 10L) ``` ```{r} worldbank %>% gather_object %>% json_types %>% count(name, type) ``` ```{r, echo = FALSE, message = FALSE} options(tibble.print_min = 4L, tibble.print_max = 4L) ``` It appears that `majorsector_percent` is an array, and so we can use `enter_object` to enter into it: ```{r} worldbank %>% enter_object(majorsector_percent) ``` and `gather_array` to gather it by index ```{r} worldbank %>% enter_object(majorsector_percent) %>% gather_array ``` We can then `spread_all` again to capture the name-value pairs stored in each object ```{r} worldbank %>% enter_object(majorsector_percent) %>% gather_array %>% spread_all ``` By combining with our initial top-level `spread_all`, we can aggregate funding dollars by sector by region: ```{r} worldbank %>% spread_all %>% select(region = regionname, funding = totalamt) %>% enter_object(majorsector_percent) %>% gather_array %>% spread_all %>% rename(sector = Name, percent = Percent) %>% group_by(region, sector) %>% summarize(funding = sum(funding * percent)) ``` ## Tidyjson functions Tidyjson provides the following core functions: | Function | Use | Row Effect | Column Effect | JSON Effect | |:------------------|:-----------------------------|:---------------------|:------------------|:-------------| | `spread_all` | Spread all object values | | add many columns | | | `spread_values` | Spread specific columns | | specific columns | | | `gather_array` | Gather a JSON array | Duplicates rows | index column | enter array | | `gather_object` | Gather a JSON object | Duplicates rows | name column | enter object | | `append_values_X` | Append all values of a type | | column of type X | | | `enter_object` | Enter into an object by name | Keeps rows with name | | enter object | | `json_types` | Identify JSON type | | type column | | | `json_lengths` | Identify JSON length | | length column | | These functions can be composed into pipelines using the `%>%` operator, as each takes a `tbl_json` object and returns a `tbl_json` object. ## Common patterns The following are common patterns for manipulating JSON data with tidyjson. ### Spread selected Spread all object values and select a subset to continue working with. `spread_all %>% select` ```{r} worldbank %>% spread_all %>% select(regionname, totalamt) ``` ### Object structure Identify the names under an object, and their type and frequency. `gather_object %>% json_types %>% count(name, type)` ```{r} worldbank %>% gather_object %>% json_types %>% count(name, type) ``` ### Gather nested array Enter into an array nested under an object, and gather it `enter_object %>% gather_array` ```{r} worldbank %>% enter_object(majorsector_percent) %>% gather_array ``` ### Filter for a specific type Filter to just objects / arrays and then gather them `filter(is_json_X(.)) %>% gather_X` ```{r} companies[1] %>% gather_object %>% filter(is_json_array(.)) %>% gather_array ``` ```{r} companies[1] %>% gather_object %>% filter(is_json_object(.)) %>% gather_object ``` ### Gather and append Useful when data is stored in object names as well as values `gather_object %>% append_values_X` ```{r} json <- '{"2015": 5, "2016": 10}' json %>% gather_object("year") %>% append_values_number("count") ``` ## Getting started with JSON data The first step in using tidyjson is to get your data into a `tbl_json` object. All tidyjson functions automatically coerce objects into `tbl_json` if they are not already, so you may be able to just start manipulating your data directly. But if not, you can use `as.tbl_json` directly. Here are examples for common scenarios: ### Character vector The simplest case is when your JSON data is already in R as a character vector, like the `worldbank` data: ```{r} worldbank %>% as.tbl_json ``` If this generates errors, then likely your JSON data is malformed: ```{r, error = TRUE} bad_json <- '{"key": "value"' bad_json %>% as.tbl_json ``` tidyjson uses `jsonlite::fromJSON` to parse the JSON, and so will print out a useful error message. ### Single array Many APIs will return multiple documents in a single array, like the `issues` data from github. ```{r} issues %>% as.tbl_json ``` A single call to gather_array makes this data look like the `worldbank` data: ```{r} issues %>% as.tbl_json %>% gather_array ``` ### List If your JSON is a list of character strings, you can use `purrr::flatten_chr' to flatten it into a character vector and then proceed as usual: ```{r} library(purrr) list('1', '2') %>% flatten_chr %>% as.tbl_json ``` ### Data frame If you extracted JSON from a table in a database into a data frame, then likely you already have other columns in the data frame you would like to retain. You can use then `json.column` argument to `as.tbl_json` to specify which column contains the JSON of interest: ```{r} df <- tibble(id = 1:2, json = list('[1, 2]', '[3, 4]')) df %>% as.tbl_json(json.column = "json") ``` ### File If your JSON is in a file, like in the [jsonlines](https://jsonlines.org/) format, then you can use `read_json` to read the file into a `tbl_json` object directly. ### JSON included in the package The tidyjson package comes with several JSON example datasets: * `commits`: commit data for the [dplyr](https://github.com/tidyverse/dplyr) repo from github API * `issues`: issue data for the [dplyr](https://github.com/tidyverse/dplyr) repo from github API * `worldbank`: world bank funded projects from `jsonstudio` * `companies`: startup company data from `jsonstudio` Each dataset has some example tidyjson queries in `help(commits)`, `help(issues)`, `help(worldbank)` and `help(companies)`.