Introduction to tidyjson

JSON (JavaScript Object Notation) is a lightweight and flexible data format that is easy for humans to read and for machines to parse. JSON has become a common format used in:

Public APIs (e.g., Twitter)
NoSQL databases as a document format (e.g., MongoDB)
Relational databases as a new column type (e.g., PostgreSQL)

Tidyjson provides a grammar for turning complex JSON data into tidy data frames that are easy to work with in the tidyverse.

Why use tidyjson?

Several libraries exist for working with JSON data in R, such as rjson, rjsonio and jsonlite. These libraries transform JSON data automatically into nested R lists or complex data frames. However, working with these complex objects can be difficult.

The tidyjson package takes a different approach to structuring JSON data into tidy data frames. Similar to tidyr, tidyjson builds a grammar for manipulating JSON into a tidy table structure.

Tidyjson is based on the following principles:

Work on a single JSON document, or on a collection of related documents
Create pipelines with %>%, producing code that can be read from left to right
Guarantee the structure of the data produced, even if the input JSON structure changes (with the exception of spread_all)
Work with arbitrarily nested arrays or objects
Handle ‘ragged’ arrays and / or objects (varying lengths by document)
Allow for extraction of data in values or object names
Ensure edge cases are handled correctly (especially empty data)
Integrate seamlessly with dplyr, allowing tbl_json objects to pipe in and out of dplyr verbs where reasonable

A simple example

A simple example of how tidyjson works is as follows:

library(dplyr)

# Define a simple people JSON collection
people <- c('{"age": 32, "name": {"first": "Bob",   "last": "Smith"}}',
            '{"age": 54, "name": {"first": "Susan", "last": "Doe"}}',
            '{"age": 18, "name": {"first": "Ann",   "last": "Jones"}}')

# Tidy the JSON data
people %>% spread_all
#> # A tbl_json: 3 x 5 tibble with a "JSON" attribute
#>   ..JSON                  document.id   age name.first name.last
#>   <chr>                         <int> <dbl> <chr>      <chr>    
#> 1 "{\"age\":32,\"name..."           1    32 Bob        Smith    
#> 2 "{\"age\":54,\"name..."           2    54 Susan      Doe      
#> 3 "{\"age\":18,\"name..."           3    18 Ann        Jones

This produces a tbl_json object, where each row corresponds to an element of the people vector (a “document” in tidyjson). The JSON attribute of the tbl_json object is shown first, then the columns of the tibble are shown - a document.id indicating which document the row originated in, and then the age and name columns that spread_all created.

A more complex example

The tidyjson package really shines in a more complex example. Consider the worldbank data included in the tidyjson package.

worldbank %>% str
#>  chr [1:500] "{\"_id\":{\"$oid\":\"52b213b38594d8a2be17c780\"},\"boardapprovaldate\":\"2013-11-12T00:00:00Z\",\"closingdate\""| __truncated__ ...

It is a 500 length character vector of projects funded by the world bank. We can quickly expand all simple columns using spread_all

worldbank %>% spread_all
#> # A tbl_json: 500 x 9 tibble with a "JSON" attribute
#>   ..JSON document.id boardapprovaldate closingdate countryshortname project_name
#>   <chr>        <int> <chr>             <chr>       <chr>            <chr>       
#> 1 "{\"_…           1 2013-11-12T00:00… 2018-07-07… Ethiopia         Ethiopia Ge…
#> 2 "{\"_…           2 2013-11-04T00:00… <NA>        Tunisia          TN: DTF Soc…
#> 3 "{\"_…           3 2013-11-01T00:00… <NA>        Tuvalu           Tuvalu Avia…
#> 4 "{\"_…           4 2013-10-31T00:00… <NA>        Yemen, Republic… Gov't and C…
#> # ℹ 496 more rows
#> # ℹ 3 more variables: regionname <chr>, totalamt <dbl>, `_id.$oid` <chr>

And we can limit the coluns produced by calling dplyr::select after

worldbank %>% spread_all %>% select(regionname, totalamt)
#> # A tbl_json: 500 x 3 tibble with a "JSON" attribute
#>   ..JSON                   regionname                    totalamt
#>   <chr>                    <chr>                            <dbl>
#> 1 "{\"_id\":{\"$oid\":..." Africa                       130000000
#> 2 "{\"_id\":{\"$oid\":..." Middle East and North Africa         0
#> 3 "{\"_id\":{\"$oid\":..." East Asia and Pacific          6060000
#> 4 "{\"_id\":{\"$oid\":..." Middle East and North Africa         0
#> # ℹ 496 more rows

But worldbank also contains arrays, which cannot be naively spread into new columns. We can use gather_object to gather all name-value paris by name, and then json_types to identify the type of JSON stored under each value, and dplyr::count to aggregate across documents:

worldbank %>% gather_object %>% json_types %>% count(name, type)
#> # A tibble: 8 × 3
#>   name                type       n
#>   <chr>               <fct>  <int>
#> 1 _id                 object   500
#> 2 boardapprovaldate   string   500
#> 3 closingdate         string   370
#> 4 countryshortname    string   500
#> 5 majorsector_percent array    500
#> 6 project_name        string   500
#> 7 regionname          string   500
#> 8 totalamt            number   500

It appears that majorsector_percent is an array, and so we can use enter_object to enter into it:

worldbank %>% enter_object(majorsector_percent)
#> # A tbl_json: 500 x 2 tibble with a "JSON" attribute
#>   ..JSON                  document.id
#>   <chr>                         <int>
#> 1 "[{\"Name\":\"Educa..."           1
#> 2 "[{\"Name\":\"Publi..."           2
#> 3 "[{\"Name\":\"Trans..."           3
#> 4 "[{\"Name\":\"Healt..."           4
#> # ℹ 496 more rows

and gather_array to gather it by index

worldbank %>% enter_object(majorsector_percent) %>% gather_array
#> # A tbl_json: 1,405 x 3 tibble with a "JSON" attribute
#>   ..JSON                  document.id array.index
#>   <chr>                         <int>       <int>
#> 1 "{\"Name\":\"Educat..."           1           1
#> 2 "{\"Name\":\"Educat..."           1           2
#> 3 "{\"Name\":\"Public..."           1           3
#> 4 "{\"Name\":\"Educat..."           1           4
#> # ℹ 1,401 more rows

We can then spread_all again to capture the name-value pairs stored in each object

worldbank %>% 
  enter_object(majorsector_percent) %>% gather_array %>% spread_all
#> # A tbl_json: 1,405 x 5 tibble with a "JSON" attribute
#>   ..JSON                  document.id array.index Name                   Percent
#>   <chr>                         <int>       <int> <chr>                    <dbl>
#> 1 "{\"Name\":\"Educat..."           1           1 Education                   46
#> 2 "{\"Name\":\"Educat..."           1           2 Education                   26
#> 3 "{\"Name\":\"Public..."           1           3 Public Administration…      16
#> 4 "{\"Name\":\"Educat..."           1           4 Education                   12
#> # ℹ 1,401 more rows

By combining with our initial top-level spread_all, we can aggregate funding dollars by sector by region:

worldbank %>%
  spread_all %>% select(region = regionname, funding = totalamt) %>%
  enter_object(majorsector_percent) %>% gather_array %>% 
  spread_all %>% rename(sector = Name, percent = Percent) %>%
  group_by(region, sector) %>%
  summarize(funding = sum(funding * percent))
#> `summarise()` has regrouped the output.
#> ℹ Summaries were computed grouped by region and sector.
#> ℹ Output is grouped by region.
#> ℹ Use `summarise(.groups = "drop_last")` to silence this message.
#> ℹ Use `summarise(.by = c(region, sector))` for per-operation grouping
#>   (`?dplyr::dplyr_by`) instead.
#> # A tibble: 63 × 3
#> # Groups:   region [7]
#>   region sector                                 funding
#>   <chr>  <chr>                                    <dbl>
#> 1 Africa Agriculture, fishing, and forestry 96648500000
#> 2 Africa Education                          83370000000
#> 3 Africa Energy and mining                  86709960000
#> 4 Africa Finance                            30761000000
#> # ℹ 59 more rows

Tidyjson functions

Tidyjson provides the following core functions:

Function	Use	Row Effect	Column Effect	JSON Effect
`spread_all`	Spread all object values		add many columns
`spread_values`	Spread specific columns		specific columns
`gather_array`	Gather a JSON array	Duplicates rows	index column	enter array
`gather_object`	Gather a JSON object	Duplicates rows	name column	enter object
`append_values_X`	Append all values of a type		column of type X
`enter_object`	Enter into an object by name	Keeps rows with name		enter object
`json_types`	Identify JSON type		type column
`json_lengths`	Identify JSON length		length column

These functions can be composed into pipelines using the %>% operator, as each takes a tbl_json object and returns a tbl_json object.

Common patterns

The following are common patterns for manipulating JSON data with tidyjson.

Spread selected

Spread all object values and select a subset to continue working with.

spread_all %>% select

worldbank %>% spread_all %>% select(regionname, totalamt)
#> # A tbl_json: 500 x 3 tibble with a "JSON" attribute
#>   ..JSON                   regionname                    totalamt
#>   <chr>                    <chr>                            <dbl>
#> 1 "{\"_id\":{\"$oid\":..." Africa                       130000000
#> 2 "{\"_id\":{\"$oid\":..." Middle East and North Africa         0
#> 3 "{\"_id\":{\"$oid\":..." East Asia and Pacific          6060000
#> 4 "{\"_id\":{\"$oid\":..." Middle East and North Africa         0
#> # ℹ 496 more rows

Object structure

Identify the names under an object, and their type and frequency.

gather_object %>% json_types %>% count(name, type)

worldbank %>% gather_object %>% json_types %>% count(name, type)
#> # A tibble: 8 × 3
#>   name              type       n
#>   <chr>             <fct>  <int>
#> 1 _id               object   500
#> 2 boardapprovaldate string   500
#> 3 closingdate       string   370
#> 4 countryshortname  string   500
#> # ℹ 4 more rows

Gather nested array

Enter into an array nested under an object, and gather it

enter_object %>% gather_array

worldbank %>% enter_object(majorsector_percent) %>% gather_array
#> # A tbl_json: 1,405 x 3 tibble with a "JSON" attribute
#>   ..JSON                  document.id array.index
#>   <chr>                         <int>       <int>
#> 1 "{\"Name\":\"Educat..."           1           1
#> 2 "{\"Name\":\"Educat..."           1           2
#> 3 "{\"Name\":\"Public..."           1           3
#> 4 "{\"Name\":\"Educat..."           1           4
#> # ℹ 1,401 more rows

Filter for a specific type

Filter to just objects / arrays and then gather them

filter(is_json_X(.)) %>% gather_X

companies[1] %>% gather_object %>% 
  filter(is_json_array(.)) %>% gather_array
#> # A tbl_json: 8 x 4 tibble with a "JSON" attribute
#>   ..JSON                  document.id name           array.index
#>   <chr>                         <int> <chr>                <int>
#> 1 "{\"is_past\":fals..."            1 relationships            1
#> 2 "{\"id\":6252,\"rou..."           1 funding_rounds           1
#> 3 "{\"id\":15229,\"ro..."           1 funding_rounds           2
#> 4 "{\"id\":22565,\"ro..."           1 funding_rounds           3
#> # ℹ 4 more rows

companies[1] %>% gather_object %>% 
  filter(is_json_object(.)) %>% gather_object
#> Warning in gather_object(.): name column name already exists, changing to
#> name.2
#> # A tbl_json: 3 x 4 tibble with a "JSON" attribute
#>   ..JSON                document.id name  name.2         
#>   <chr>                       <int> <chr> <chr>          
#> 1 "\"52cdef7e4bab8b..."           1 _id   $oid           
#> 2 "[[[150,22],\"ass..."           1 image available_sizes
#> 3 "null"                          1 image attribution

Gather and append

Useful when data is stored in object names as well as values

gather_object %>% append_values_X

json <- '{"2015": 5, "2016": 10}'
json %>% gather_object("year") %>% append_values_number("count")
#> # A tbl_json: 2 x 4 tibble with a "JSON" attribute
#>   ..JSON document.id year  count
#>   <chr>        <int> <chr> <dbl>
#> 1 5                1 2015      5
#> 2 10               1 2016     10

Getting started with JSON data

The first step in using tidyjson is to get your data into a tbl_json object. All tidyjson functions automatically coerce objects into tbl_json if they are not already, so you may be able to just start manipulating your data directly.

But if not, you can use as.tbl_json directly. Here are examples for common scenarios:

Character vector

The simplest case is when your JSON data is already in R as a character vector, like the worldbank data:

worldbank %>% as.tbl_json
#> # A tbl_json: 500 x 2 tibble with a "JSON" attribute
#>   ..JSON                   document.id
#>   <chr>                          <int>
#> 1 "{\"_id\":{\"$oid\":..."           1
#> 2 "{\"_id\":{\"$oid\":..."           2
#> 3 "{\"_id\":{\"$oid\":..."           3
#> 4 "{\"_id\":{\"$oid\":..."           4
#> # ℹ 496 more rows

If this generates errors, then likely your JSON data is malformed:

bad_json <- '{"key": "value"'
bad_json %>% as.tbl_json
#> Error in `purrr::map()`:
#> ℹ In index: 1.
#> Caused by error:
#> ! parse error: premature EOF
#>                                        {"key": "value"
#>                      (right here) ------^

tidyjson uses jsonlite::fromJSON to parse the JSON, and so will print out a useful error message.

Single array

Many APIs will return multiple documents in a single array, like the issues data from github.

issues %>% as.tbl_json
#> # A tbl_json: 1 x 2 tibble with a "JSON" attribute
#>   ..JSON                  document.id
#>   <chr>                         <int>
#> 1 "[{\"url\":\"https:..."           1

A single call to gather_array makes this data look like the worldbank data:

issues %>% as.tbl_json %>% gather_array
#> # A tbl_json: 30 x 3 tibble with a "JSON" attribute
#>   ..JSON                  document.id array.index
#>   <chr>                         <int>       <int>
#> 1 "{\"url\":\"https:/..."           1           1
#> 2 "{\"url\":\"https:/..."           1           2
#> 3 "{\"url\":\"https:/..."           1           3
#> 4 "{\"url\":\"https:/..."           1           4
#> # ℹ 26 more rows

List

If your JSON is a list of character strings, you can use `purrr::flatten_chr’ to flatten it into a character vector and then proceed as usual:

library(purrr)
list('1', '2') %>% flatten_chr %>% as.tbl_json
#> # A tbl_json: 2 x 2 tibble with a "JSON" attribute
#>   ..JSON document.id
#>   <chr>        <int>
#> 1 1                1
#> 2 2                2

Data frame

If you extracted JSON from a table in a database into a data frame, then likely you already have other columns in the data frame you would like to retain. You can use then json.column argument to as.tbl_json to specify which column contains the JSON of interest:

df <- tibble(id = 1:2, json = list('[1, 2]', '[3, 4]'))
df %>% as.tbl_json(json.column = "json")
#> # A tbl_json: 2 x 2 tibble with a "JSON" attribute
#>   ..JSON          id
#>   <chr>        <int>
#> 1 "\"[1, 2]\""     1
#> 2 "\"[3, 4]\""     2

File

If your JSON is in a file, like in the jsonlines format, then you can use read_json to read the file into a tbl_json object directly.

JSON included in the package

The tidyjson package comes with several JSON example datasets:

commits: commit data for the dplyr repo from github API
issues: issue data for the dplyr repo from github API
worldbank: world bank funded projects from jsonstudio
companies: startup company data from jsonstudio

Each dataset has some example tidyjson queries in help(commits), help(issues), help(worldbank) and help(companies).